[00:03.800 --> 00:06.200]  Are we ready to go?
[00:07.320 --> 00:08.580]  We are.
[00:08.700 --> 00:09.740]  Perfect.
[00:10.340 --> 00:14.280]  Hello and welcome to Machine Learning for Security Analysts.
[00:14.280 --> 00:19.200]  This is our first workshop here at the AI village, DEF CON Safe Mode.
[00:19.620 --> 00:24.520]  So the goal here is to kind of take machine learning out of the buzzword and into the mainstream.
[00:25.200 --> 00:30.560]  So buzzwords like machine learning and AI have been in the industry for a couple of years now,
[00:30.560 --> 00:36.620]  but for most people, it's not really well understood what these actually mean.
[00:36.900 --> 00:39.660]  So my goal is to kind of introduce you to machine learning,
[00:39.660 --> 00:43.880]  and we're going to do this by kind of going over what machine learning is,
[00:43.880 --> 00:46.860]  how it fits in the broader concept of AI,
[00:47.080 --> 00:52.340]  and why it's important for security analysts and hackers to understand machine learning,
[00:52.340 --> 00:56.000]  as well as what the machine learning process looks like.
[00:56.000 --> 01:01.340]  And then we're going to take all of that and put that into building our first machine learning model,
[01:01.340 --> 01:03.380]  which will be something very familiar.
[01:03.820 --> 01:06.020]  So let's go ahead and get started.
[01:07.600 --> 01:11.940]  So who am I? My name is Gavin. I go by the pseudonym GT Klondike.
[01:11.940 --> 01:15.220]  I'm an independent security researcher and a security consultant.
[01:15.220 --> 01:19.980]  I have a passion for network security, both attack and defense.
[01:20.080 --> 01:23.880]  And so through that passion, I run a project called NetSec Explained,
[01:23.880 --> 01:29.420]  which is a blog and a YouTube channel where I use intermediate and advanced-level concepts
[01:29.420 --> 01:33.020]  and try to explain them in a very easy-to-understand way.
[01:35.260 --> 01:39.600]  So it's usually at this point where, in person, I would kind of ask the audience,
[01:39.600 --> 01:43.100]  when you hear the word machine learning, what comes to mind?
[01:43.260 --> 01:46.660]  Some of the common answers that I get are,
[01:46.660 --> 01:52.220]  oh, you need to have a PhD in multivariable calculus and linear algebra,
[01:52.220 --> 01:58.580]  or Skynet's going to take over the world, or pattern matching.
[01:59.980 --> 02:03.460]  So these are all pretty interesting answers,
[02:03.460 --> 02:06.500]  but what we're going to walk through in this workshop is kind of show you that
[02:06.500 --> 02:08.600]  not all of that is necessary.
[02:08.600 --> 02:12.200]  You don't need a PhD in multivariable calculus or linear algebra,
[02:12.200 --> 02:19.520]  especially with some of the abstraction libraries like Scikit-learn, Keras, TensorFlow, and things like that.
[02:19.520 --> 02:24.880]  And then as far as machines taking over the world, not quite.
[02:27.140 --> 02:32.300]  So machine learning and how it fits in the broader idea of artificial intelligence.
[02:32.300 --> 02:37.200]  So artificial intelligence is this big umbrella term that kind of is a catch-all
[02:37.200 --> 02:41.360]  for different types of ways that machines operate.
[02:41.360 --> 02:45.120]  So back in the past, when they first invented digital calculators,
[02:45.120 --> 02:48.680]  they thought that was machine intelligence or artificial intelligence.
[02:48.680 --> 02:51.400]  And then it came to video game AI.
[02:52.420 --> 02:56.480]  Today we also see things like rules-based artificial intelligence,
[02:56.480 --> 02:58.560]  or classic image recognition.
[02:58.560 --> 03:01.080]  So there's a bunch of different artificial intelligence technologies
[03:01.080 --> 03:03.320]  that aren't necessarily machine learning.
[03:03.560 --> 03:07.960]  Machine learning itself is more about applied statistics,
[03:07.960 --> 03:12.140]  and a merge between what used to be considered business intelligence,
[03:12.140 --> 03:15.120]  and statisticians would do this job,
[03:15.120 --> 03:17.060]  and computer science.
[03:17.060 --> 03:22.160]  With the improvement of computing resources such as memory and CPU,
[03:22.380 --> 03:25.880]  a lot more power can be put behind some of these algorithms.
[03:26.640 --> 03:30.260]  And then deep learning is specifically talking about deep neural networks.
[03:30.260 --> 03:33.680]  Neural networks are a type of machine learning algorithm,
[03:33.680 --> 03:38.140]  and then deep learning allows us to stack layers on top of these.
[03:38.520 --> 03:43.020]  So deep learning, usually these are referred to for convolutional neural networks,
[03:43.020 --> 03:45.900]  which are used for image recognition, primarily.
[03:45.900 --> 03:50.240]  And then recurrent neural networks, which are used for cyclical pattern recognition,
[03:50.240 --> 03:51.940]  such as heartbeat monitors.
[03:54.060 --> 03:56.060]  But if we were to find all of this,
[03:56.060 --> 03:59.800]  I would say machine learning is a set of statistical techniques
[03:59.800 --> 04:03.200]  that enables the process of information mining,
[04:03.200 --> 04:05.900]  pattern discovery, and drawing inferences from data.
[04:05.900 --> 04:08.500]  The idea is that machines learn from the data,
[04:08.500 --> 04:12.480]  instead of us having to program signatures, or heuristics,
[04:12.480 --> 04:15.300]  or invent new algorithms for different problems.
[04:15.860 --> 04:18.860]  For example, deep learning, we have convolutional neural networks
[04:18.860 --> 04:22.140]  where deciding between whether something is a bike or a bus
[04:22.140 --> 04:24.800]  is the exact same algorithm and exact same problem
[04:24.800 --> 04:29.400]  as deciding whether or not something is a tiger or a cheetah.
[04:31.300 --> 04:34.220]  So a couple examples, especially in the security space,
[04:34.220 --> 04:37.920]  we have the ability to classify domain generation algorithms.
[04:37.920 --> 04:41.440]  These are primarily used by botnets as a way for the bot herder
[04:41.440 --> 04:46.540]  to stay in contact with their networks.
[04:47.360 --> 04:50.940]  Web application firewalls, these we will usually see
[04:50.940 --> 04:55.180]  with being able to perform some anomaly detection
[04:55.180 --> 04:58.180]  and see, okay, this is what normal user traffic looks like
[04:58.180 --> 05:00.680]  compared to abnormal user traffic.
[05:00.980 --> 05:03.380]  So in this case, the machine learning algorithm
[05:03.380 --> 05:07.840]  identified a potential SQL injection in the parameter
[05:07.840 --> 05:11.460]  because it's very abnormal from what it's used to seeing.
[05:12.400 --> 05:14.240]  And then, of course, network anomaly detection.
[05:14.240 --> 05:16.640]  This is one of the big ones, especially in the security side,
[05:16.640 --> 05:19.980]  the defender side of things, where we can predict
[05:19.980 --> 05:23.500]  what we think the network traffic will look like over a period of time
[05:23.500 --> 05:25.860]  and then compare that to what the actual is.
[05:25.860 --> 05:29.500]  Once it's beyond a certain delta, then that's something
[05:29.500 --> 05:33.160]  that we should pass off to a human to kind of dig in a little deeper
[05:33.160 --> 05:35.340]  and kind of figure out what's going on.
[05:36.920 --> 05:38.840]  So why is this important?
[05:39.100 --> 05:41.940]  Well, today we have over a quarter of security products
[05:41.940 --> 05:45.260]  that are used for detection that use some form of machine learning.
[05:45.400 --> 05:49.120]  And we see machine learning being applied to more and more things every day.
[05:49.120 --> 05:53.640]  We see facial recognition, we see advanced spam filtering,
[05:53.640 --> 05:57.360]  data heuristics and persona collection.
[05:58.100 --> 06:02.660]  And so in order for security analysts to properly deploy
[06:02.660 --> 06:05.180]  and manage these machine learning products,
[06:05.180 --> 06:08.400]  they need to have a better understanding of how they operate
[06:08.400 --> 06:11.260]  so that they can ensure that they work efficiently.
[06:15.060 --> 06:18.040]  So the machine learning process, it follows seven steps.
[06:18.240 --> 06:20.820]  This is not a step one, step two, step three,
[06:20.820 --> 06:22.340]  but this is really a cycle.
[06:22.460 --> 06:25.480]  So you start at step one, you go all the way down to step seven,
[06:25.480 --> 06:27.460]  and then you hop back up to step one.
[06:27.520 --> 06:30.360]  So you start off with gathering the data.
[06:30.360 --> 06:32.180]  So first you want to identify if I'm going to build
[06:32.300 --> 06:34.080]  a network anomaly detection system,
[06:34.080 --> 06:36.520]  if I'm going to build some sort of hacking tool.
[06:36.860 --> 06:39.020]  I need to gather data that applies to that,
[06:39.020 --> 06:40.800]  and then I need to prepare the data.
[06:40.800 --> 06:44.200]  So I can't just take raw information,
[06:44.200 --> 06:47.740]  I need to look at things like metadata and pre-process that.
[06:48.220 --> 06:51.220]  This step is also known as pre-processing.
[06:51.840 --> 06:56.080]  We need to choose the model, and the model is going to depend...
[06:56.080 --> 06:57.160]  There are some heuristics.
[06:57.160 --> 07:00.560]  Some models work better for different problem sets,
[07:00.560 --> 07:03.120]  but generally we're going to try a couple different models
[07:03.120 --> 07:05.180]  and see what works best.
[07:05.640 --> 07:09.000]  We're going to train the model, we're going to evaluate the model,
[07:09.000 --> 07:12.120]  and then if we think that the model can perform a little bit better
[07:12.120 --> 07:13.940]  than it's actually performing right now,
[07:13.940 --> 07:16.580]  we can do what's called hyperparameter tuning.
[07:16.680 --> 07:20.100]  And so this lets us make little tweaks here and there.
[07:20.360 --> 07:22.220]  And then of course we deploy it.
[07:22.220 --> 07:26.380]  And as I said earlier, this is a bit of a cyclical process,
[07:26.380 --> 07:28.680]  because machine learning algorithms stop learning
[07:28.680 --> 07:30.820]  as soon as you stop training them.
[07:30.820 --> 07:33.360]  And so you want to continue to gather new data
[07:33.360 --> 07:38.960]  and train new models and build them into newer and better generations.
[07:42.020 --> 07:45.800]  So before we get started, what we're going to do is hop in
[07:45.800 --> 07:48.620]  and learn how to build a basic spam filter.
[07:48.620 --> 07:50.960]  Spam filters are fantastic first examples,
[07:50.960 --> 07:52.340]  especially for a lot of the newbies,
[07:52.340 --> 07:54.420]  because we already know how spam filters work.
[07:54.620 --> 07:58.760]  If it says something that looks spammy, like buy Viagra now,
[07:58.760 --> 08:02.540]  then that's going to be automatically thrown away into our spam filter.
[08:02.720 --> 08:05.720]  If we have something that seems legitimate,
[08:05.720 --> 08:07.660]  then that's going to be passed on to the user.
[08:08.320 --> 08:11.180]  So before we get into that, are there any questions
[08:11.180 --> 08:13.740]  on any of the stuff that I've gone over so far?
[08:22.970 --> 08:23.850]  Okay.
[08:26.810 --> 08:28.030]  One more thing.
[08:28.190 --> 08:32.330]  I totally forgot this, but this is a bit of a story that really...
[08:32.330 --> 08:33.310]  Yes.
[08:36.290 --> 08:37.170]  Okay.
[08:37.250 --> 08:39.870]  This is a story that really made things click for me.
[08:40.490 --> 08:42.930]  So a friend of mine, this was a few years ago,
[08:42.930 --> 08:46.010]  was approached by a non-profit organization.
[08:46.010 --> 08:49.490]  And what this non-profit does is they have cameras set up all over Africa
[08:49.910 --> 08:53.830]  and they take pictures to see if there's a cheetah.
[08:54.170 --> 08:55.950]  And they're tracking cheetah population.
[08:55.990 --> 08:59.310]  So they'll take pictures, it'll get sent off to a human analyst,
[08:59.310 --> 09:00.530]  and the human will look at it and say,
[09:00.530 --> 09:03.170]  okay, is there a cheetah in the picture or not?
[09:03.810 --> 09:06.270]  And so they approached my friend and they said,
[09:06.270 --> 09:08.170]  hey, is there a way that we can improve this process?
[09:08.170 --> 09:09.690]  What's something that we can do here?
[09:09.690 --> 09:13.630]  And so he said, okay, let me try a couple things.
[09:13.910 --> 09:17.990]  So what he did was use a very common image library
[09:17.990 --> 09:22.070]  and said, okay, let's train a machine learning algorithm
[09:22.070 --> 09:26.050]  to identify images, if there's a cheetah in the picture or not.
[09:26.450 --> 09:29.010]  If it reached above a 5% confidence,
[09:29.010 --> 09:31.830]  says, okay, I'm 5% confident or 6% confident
[09:31.830 --> 09:35.550]  that there's a cheetah in this image, it got sent over to an analyst.
[09:35.810 --> 09:39.650]  If it was less than 5% confident, it would just throw it away.
[09:40.630 --> 09:45.210]  Now, 5%, that doesn't really sound like a lot, and it's not.
[09:46.350 --> 09:51.790]  But the ROI of what this model was able to do is pretty astounding.
[09:52.090 --> 09:56.130]  So what normally would have taken them a year of labor to perform,
[09:56.130 --> 09:58.270]  they were able to do in one month.
[09:58.450 --> 10:01.750]  That is a 1,200% increase in productivity.
[10:02.750 --> 10:04.190]  So that's pretty powerful.
[10:04.190 --> 10:05.970]  And that's just a very simple model.
[10:05.970 --> 10:09.550]  So as we can develop more and more advanced things,
[10:09.550 --> 10:11.810]  these are ways that we can really improve our processes
[10:11.810 --> 10:14.630]  to move at machine speed and scale.
[10:16.730 --> 10:17.670]  GT?
[10:17.670 --> 10:18.470]  Yeah.
[10:18.470 --> 10:22.010]  We have a question, and it's,
[10:22.010 --> 10:26.130]  any links to information pertaining to creating your own models
[10:26.130 --> 10:29.250]  rather than training existing ones?
[10:30.670 --> 10:34.590]  Creating your own models rather than training existing ones?
[10:34.890 --> 10:35.310]  Yeah.
[10:35.310 --> 10:38.550]  I think that you're going to be showing them how to create their own model.
[10:38.550 --> 10:40.370]  We are going to create our own model.
[10:41.610 --> 10:44.010]  So there are existing models that you can use,
[10:44.010 --> 10:45.610]  what's called transfer learning.
[10:45.610 --> 10:50.170]  So you take a problem, for example, ImageNet, which is a challenge.
[10:50.170 --> 10:53.190]  It can recognize hundreds of different categories of images,
[10:53.190 --> 10:55.650]  and you can take that idea, train your own model
[10:55.650 --> 10:59.630]  so that it can recognize whether or not there's a cheetah in an image.
[11:00.270 --> 11:03.030]  But what we're going to do is actually build a model from scratch.
[11:03.030 --> 11:06.870]  There's no previously trained information,
[11:06.870 --> 11:08.710]  and I'm going to walk you through how to do that,
[11:08.710 --> 11:10.630]  and we're going to see a workbook in a little bit.
[11:11.650 --> 11:12.330]  Thank you.
[11:12.330 --> 11:14.050]  Hopefully that helps answer the question.
[11:16.750 --> 11:17.670]  Okay.
[11:18.170 --> 11:21.910]  So before we get started, machine learning, it uses statistics,
[11:21.910 --> 11:26.790]  and so what we need to do is grab information and turn them into numbers.
[11:27.090 --> 11:30.350]  But if we wanted to have the idea of a spam email
[11:30.350 --> 11:33.490]  versus a regular email or a ham email,
[11:34.110 --> 11:38.410]  we need to kind of understand how we're going to interpret words into numbers.
[11:39.270 --> 11:44.470]  So here's a couple sentences that refer to sports and not sports,
[11:44.470 --> 11:46.690]  or realistically, elections.
[11:46.690 --> 11:51.690]  So we see a great game, and that talks about sports.
[11:51.950 --> 11:55.150]  And then we see the election was over, and that talks about not sports.
[11:55.790 --> 11:59.470]  So the idea is how do we take a sentence that we haven't seen before,
[11:59.470 --> 12:01.130]  because that's what machine learning is supposed to do.
[12:01.130 --> 12:03.470]  We were trained on certain information,
[12:03.470 --> 12:06.690]  but we want to predict on information that we haven't seen before.
[12:07.010 --> 12:10.170]  How do we take a new sentence and then try and make a decision
[12:10.170 --> 12:13.230]  whether or not it's talking about sports or not sports?
[12:14.950 --> 12:18.350]  So to do that, we're going to use what's called Bayes' theorem.
[12:18.390 --> 12:23.250]  Bayes' theorem is a very common probability theorem that basically says,
[12:23.250 --> 12:28.230]  okay, so if we have a sentence that we already know,
[12:28.230 --> 12:34.890]  and given the probability that this sentence is talking about sports,
[12:34.890 --> 12:38.390]  versus here's a new sentence,
[12:38.390 --> 12:42.130]  what is the probability that it's talking about sports given that new sentence?
[12:42.590 --> 12:47.990]  So the probability of A given B equals probability of B given A, blah, blah, blah.
[12:47.990 --> 12:50.650]  We don't need to know too much about this,
[12:50.650 --> 12:54.250]  but it's important to get a good foundation of this understanding.
[12:54.250 --> 12:57.070]  So the idea is, what is the probability that the sentence,
[12:57.130 --> 13:00.230]  a very close game, is talking about sports?
[13:01.050 --> 13:02.430]  We don't know the answer to that.
[13:02.430 --> 13:05.850]  We've never come across the sentence of a very close game,
[13:05.850 --> 13:10.090]  but we would need to fill it into this algorithm here.
[13:10.930 --> 13:14.570]  So instead, what we can do is actually break up a very close game
[13:14.570 --> 13:19.050]  into their individual words, a, very, close, and game.
[13:19.110 --> 13:21.510]  And then we can turn that probability into,
[13:21.510 --> 13:26.510]  what's the probability that A is talking about sports?
[13:26.510 --> 13:28.990]  What's the probability that very is talking about sports?
[13:28.990 --> 13:31.310]  What's the probability of close talking about sports,
[13:31.310 --> 13:33.810]  and game talking about sports?
[13:35.970 --> 13:39.030]  So this winds up becoming a bit of a counting game.
[13:39.230 --> 13:42.630]  How many times do we see A, and it's talking about sports?
[13:42.630 --> 13:43.550]  Twice.
[13:43.910 --> 13:46.250]  And of course we need to do the same thing for not sports,
[13:46.250 --> 13:49.130]  because we need to find the probability of A happening,
[13:49.130 --> 13:50.970]  the probability of B happening,
[13:50.970 --> 13:52.790]  and we take whichever is higher.
[13:52.910 --> 13:55.310]  So the probability that A is talking about sports,
[13:55.310 --> 13:59.050]  we see it twice in two of our sentences,
[13:59.050 --> 14:02.250]  or twice across all of our sports sentences.
[14:02.250 --> 14:03.470]  Got to be precise.
[14:03.770 --> 14:06.870]  We don't see A in not sports at all,
[14:06.870 --> 14:10.110]  so that's going to be zero across all of our not sports sentences.
[14:10.870 --> 14:15.150]  Very shows up in sports once, does not show up in not sports.
[14:15.150 --> 14:19.550]  Close shows up in not sports once,
[14:19.550 --> 14:21.350]  does not show up in sports,
[14:21.350 --> 14:23.830]  and then game shows up in sports twice,
[14:23.830 --> 14:25.810]  and it doesn't show up in not sports.
[14:27.110 --> 14:29.170]  So we take those numbers and we plug them in.
[14:29.170 --> 14:31.670]  We saw for sports, A showed up twice,
[14:31.670 --> 14:32.790]  Very showed up once,
[14:32.790 --> 14:34.830]  Close didn't show up at all,
[14:34.830 --> 14:36.350]  and Game showed up twice.
[14:36.450 --> 14:38.650]  And then if we count the number of words that we have
[14:38.650 --> 14:40.570]  across all of the sports sentences,
[14:40.570 --> 14:41.890]  we have a total of 11.
[14:41.890 --> 14:44.150]  So it just becomes a simple 2 divided by 11,
[14:44.150 --> 14:46.870]  1 divided by 11, and so on.
[14:47.550 --> 14:50.370]  But we can't use straight, naive Bayesian.
[14:50.430 --> 14:53.530]  And the reason why is because we have these zeros.
[14:53.910 --> 14:57.250]  Now, close is a word that we've never seen in sports,
[14:57.250 --> 14:59.630]  so we get zero, and that winds up being
[14:59.630 --> 15:01.350]  zero divided by 11, which is zero.
[15:01.350 --> 15:03.310]  Zero times everything is zero,
[15:03.310 --> 15:05.650]  and so unfortunately the probability of
[15:05.650 --> 15:08.890]  this sentence being about sports,
[15:08.890 --> 15:11.190]  if we just use straight Bayesian,
[15:11.770 --> 15:12.950]  is zero.
[15:13.530 --> 15:15.550]  So that doesn't really work for us.
[15:16.590 --> 15:18.330]  So instead, what we're going to use is called
[15:18.330 --> 15:20.230]  multinomial naive Bayesian.
[15:20.230 --> 15:22.130]  It's a very similar formula, but we have
[15:22.130 --> 15:23.710]  what's called a smoothing filter,
[15:23.710 --> 15:25.050]  and that's this alpha.
[15:25.050 --> 15:27.070]  We can set this alpha to whatever we want.
[15:27.070 --> 15:28.650]  Usually we just set it to 1.
[15:29.830 --> 15:32.070]  But we can modify that, and this will be
[15:33.130 --> 15:34.830]  a hyperparameter that we'll take a look at
[15:34.830 --> 15:36.430]  later in our workbook.
[15:38.590 --> 15:40.550]  So with the smoothing filter,
[15:40.550 --> 15:42.250]  we do the exact same word counting.
[15:42.250 --> 15:43.750]  We see A showed up twice in sports.
[15:43.750 --> 15:45.070]  Very showed up once in sports.
[15:45.070 --> 15:46.210]  Close, zero.
[15:46.650 --> 15:48.610]  Game showed up twice in sports.
[15:48.650 --> 15:50.590]  And then we do the same thing for not sports.
[15:50.810 --> 15:53.530]  But you see here where it says close for sports?
[15:53.830 --> 15:55.590]  That smoothing filter makes it so that
[15:55.590 --> 15:58.710]  that number never becomes zero.
[15:58.810 --> 16:00.810]  It's always greater than zero.
[16:00.810 --> 16:03.090]  So our probabilities will always fit.
[16:03.530 --> 16:05.090]  And then this is just simple addition
[16:05.090 --> 16:06.450]  and then division.
[16:07.450 --> 16:09.490]  And if we took out a calculator,
[16:09.490 --> 16:11.030]  did all our math, we can see that
[16:11.650 --> 16:16.410]  AVeryCloseGame has a probability of .0000461
[16:16.410 --> 16:18.810]  that it's talking about sports.
[16:18.810 --> 16:20.770]  And then AVeryCloseGame has a probability of
[16:22.530 --> 16:25.790]  0000143 that it's talking about not sports.
[16:25.970 --> 16:27.250]  So we take whichever is higher,
[16:27.250 --> 16:29.270]  in this case sports, and so our classifier
[16:30.790 --> 16:33.510]  has decided that AVeryCloseGame
[16:33.510 --> 16:35.970]  is talking about sports.
[16:39.430 --> 16:41.670]  So I know that was a lot of information.
[16:41.930 --> 16:43.690]  Let me know if there's any questions.
[16:44.090 --> 16:47.290]  I'm going to kind of hop forward a little bit,
[16:47.290 --> 16:49.450]  but TAs, feel free to chime in
[16:49.450 --> 16:51.130]  if there's any questions on this.
[16:51.170 --> 16:53.990]  I know this bit's a little confusing sometimes,
[16:53.990 --> 16:56.110]  but it's not something we need to harp on too much.
[16:56.110 --> 16:57.750]  It's just an underlying foundation
[16:57.750 --> 16:58.930]  that's good to know.
[17:01.450 --> 17:03.430]  So the five things that we need to keep track of
[17:03.430 --> 17:06.450]  in order to do our pen and pencil math
[17:06.450 --> 17:08.330]  that we did earlier, is we need to keep track
[17:08.330 --> 17:10.590]  of the total number of unique words,
[17:10.590 --> 17:12.910]  the total number of words in spam,
[17:12.910 --> 17:15.410]  the total number of words in ham,
[17:15.410 --> 17:19.870]  and these two were what we've put in the denominator.
[17:20.450 --> 17:22.530]  The count of each word in spam
[17:22.530 --> 17:24.490]  and the count of each word in ham.
[17:24.490 --> 17:27.230]  So this is where we had to count the word A,
[17:27.230 --> 17:30.310]  count the word very, count close, count game.
[17:31.290 --> 17:32.550]  And so these are the five things
[17:32.550 --> 17:34.090]  that we need to pay attention to.
[17:35.290 --> 17:37.610]  Are there any questions so far?
[17:41.050 --> 17:43.730]  It looks like we're clear on questions right now.
[17:43.770 --> 17:44.650]  Awesome.
[17:45.770 --> 17:47.970]  Okay, so how do we take this idea
[17:47.970 --> 17:50.490]  and expand it to a full email?
[17:51.810 --> 17:54.210]  Well, here's a spam email.
[17:55.110 --> 17:56.910]  Yeah, no, this is not a spam email.
[17:56.910 --> 17:58.550]  This is a regular email.
[17:58.550 --> 18:00.550]  And we're going to use this as a common example
[18:00.550 --> 18:02.310]  throughout the rest of our workbook.
[18:02.650 --> 18:04.330]  So in this example email,
[18:04.330 --> 18:07.390]  we have re, re, East Asian fonts in Lenny.
[18:07.390 --> 18:08.410]  Thanks for your support.
[18:08.410 --> 18:10.350]  Installing Unifonts did it well for me.
[18:11.870 --> 18:14.530]  Now, we could take this entire email
[18:14.530 --> 18:16.530]  and look at it,
[18:16.530 --> 18:19.030]  but there are some things that we want to
[18:19.650 --> 18:20.790]  probably remove,
[18:20.790 --> 18:22.430]  things that don't really make a lot of sense,
[18:22.430 --> 18:24.210]  especially in the English language.
[18:24.350 --> 18:26.950]  So punctuation, punctuation doesn't really make sense.
[18:26.950 --> 18:29.610]  It doesn't add any context to a sentence.
[18:30.950 --> 18:34.110]  Small words, which are called stop words,
[18:34.110 --> 18:37.510]  such as in, of, on, for, the,
[18:37.510 --> 18:39.870]  they also don't add any context.
[18:39.870 --> 18:41.610]  So we can just go ahead and remove that,
[18:41.610 --> 18:43.850]  and that'll reduce the amount of words
[18:43.850 --> 18:45.330]  that we're looking at in this email,
[18:45.330 --> 18:46.970]  because we want to look at keywords.
[18:48.810 --> 18:51.990]  Things like unsubscribe and unsubscribed,
[18:51.990 --> 18:54.990]  or thanks, or thank you.
[18:54.990 --> 18:57.310]  They mean the same thing.
[18:57.390 --> 18:59.590]  A really good example, especially in English,
[18:59.590 --> 19:02.090]  is congrats and congratulations.
[19:02.470 --> 19:04.050]  They mean the same thing.
[19:04.050 --> 19:06.370]  And if we want to narrow this down to keywords,
[19:06.370 --> 19:08.350]  we should do what's called stemming.
[19:08.350 --> 19:09.530]  And so what stemming does
[19:09.530 --> 19:11.350]  is it shortens the word down
[19:11.350 --> 19:14.270]  so that you have I thanked somebody today,
[19:14.270 --> 19:16.410]  as in past tense, thank-ed,
[19:17.490 --> 19:20.830]  thanks, thank with an S,
[19:20.830 --> 19:23.290]  and thank you, where it's just plain thank.
[19:23.290 --> 19:25.710]  We want to count those all as the same word.
[19:27.630 --> 19:28.450]  And then, of course,
[19:28.450 --> 19:30.610]  we want to set everything to lowercase,
[19:30.610 --> 19:34.510]  because capital words such as unsubscribe,
[19:34.510 --> 19:38.450]  all caps, is the exact same as unsubscribe, lowercase.
[19:38.710 --> 19:41.370]  So this is our preprocessing step.
[19:41.370 --> 19:43.370]  And what we did was we removed all the stop words.
[19:43.370 --> 19:47.070]  We removed re, in, for, your, did, it, for.
[19:48.130 --> 19:50.350]  And then we're going to stem the words.
[19:50.350 --> 19:52.570]  I didn't show the stemming in here,
[19:52.570 --> 19:54.590]  and we'll see in the workbook as we go along
[19:54.590 --> 19:56.490]  what the stemming actually looks like.
[19:56.770 --> 20:00.130]  But we can see that we kind of shrunk the email a little bit,
[20:00.130 --> 20:01.910]  and this is going to be a little easier
[20:01.910 --> 20:05.270]  for us to compare emails to each other.
[20:06.430 --> 20:09.850]  So I think this is the point
[20:09.850 --> 20:12.730]  where we can go ahead and open up our workbook.
[20:13.430 --> 20:17.110]  So I'm going to escape out of there, open this up.
[20:17.410 --> 20:18.670]  So for our workshops,
[20:18.670 --> 20:19.790]  we're going to do this for this workshop
[20:19.790 --> 20:20.870]  and for the other workshops.
[20:20.870 --> 20:22.490]  We're going to use a system called MyBender.
[20:22.490 --> 20:24.310]  MyBender is fantastic.
[20:24.310 --> 20:27.870]  It uses Jupyter notebooks in a Docker container.
[20:27.870 --> 20:31.570]  So I'm going to go ahead and copy this
[20:31.570 --> 20:36.590]  and send it off into Discord somehow.
[20:37.810 --> 20:41.790]  Actually, is there a way...
[20:41.790 --> 20:44.870]  I'm going to kind of hide my screen for a second.
[20:45.730 --> 20:47.210]  Yeah, just post it to me,
[20:47.210 --> 20:50.530]  and I'll put it in all the places for you.
[20:51.950 --> 20:53.590]  Yeah, absolutely.
[20:55.690 --> 20:56.570]  Okay.
[20:56.830 --> 20:58.290]  All right, perfect.
[20:58.290 --> 20:59.410]  Got it.
[21:00.150 --> 21:01.450]  Display capture.
[21:01.490 --> 21:03.870]  I'm still getting used to OBS and Twitch streaming.
[21:03.870 --> 21:06.590]  I haven't done this before, but I'm really excited.
[21:07.990 --> 21:09.610]  So MyBender, once you get the link,
[21:09.610 --> 21:12.430]  you can just grab that link, paste it in here,
[21:12.430 --> 21:14.270]  or you can click launch,
[21:14.270 --> 21:16.810]  and you can do this with any GitHub repository.
[21:17.210 --> 21:20.850]  So this is going to launch some Jupyter notebooks.
[21:20.850 --> 21:25.150]  It takes about a minute or two to spin up.
[21:25.430 --> 21:28.870]  So I'm going to let you guys get that link from the TA,
[21:28.870 --> 21:30.450]  from the workshops channel,
[21:30.450 --> 21:34.030]  and then I'll let this spin up for me.
[21:34.030 --> 21:36.370]  I'll wait a little bit longer so that it spins up for you,
[21:36.370 --> 21:38.710]  and then we can get moving.
[21:40.130 --> 21:42.570]  While this is happening, are there any questions?
[21:42.570 --> 21:44.310]  Feel free to just write them in the channel,
[21:44.310 --> 21:46.270]  and the TAs will relay them.
[22:02.570 --> 22:05.910]  And once yours spins up the way that I have this laid out,
[22:05.910 --> 22:08.810]  go into the workbooks directory.
[22:09.070 --> 22:12.930]  If you would like to follow along in the live coding session,
[22:12.930 --> 22:18.550]  we're going to use the workbook spam filter sklearn document.
[22:18.550 --> 22:21.630]  If you are not that familiar with Python,
[22:21.630 --> 22:23.570]  and you just want to follow along,
[22:23.570 --> 22:26.310]  you can go ahead and use the completed workbook.
[22:27.870 --> 22:32.830]  I would give it just another couple of minutes to make sure theirs are spun up.
[22:32.830 --> 22:34.690]  Yes, absolutely. Absolutely.
[22:36.050 --> 22:39.170]  So I just want to check something real quick on my other screen.
[23:06.800 --> 23:09.140]  And these workbooks outside of this workshop,
[23:09.140 --> 23:11.480]  you can play with them on your own.
[23:11.680 --> 23:15.300]  I have a couple others if you want to take a look.
[23:16.160 --> 23:18.120]  But yeah, for the purpose of this workshop,
[23:18.120 --> 23:20.660]  we're going to be using the spam filter sklearn,
[23:20.660 --> 23:23.600]  or the completed spam filter sklearn.
[23:32.700 --> 23:33.620]  Okay.
[23:35.280 --> 23:38.120]  So I'm going to go ahead and start.
[23:39.020 --> 23:43.660]  The first couple things that I'm going to go over are just next, next, next run.
[23:43.880 --> 23:47.560]  If you're still waiting for your instance to spin up, don't feel bad.
[23:47.560 --> 23:49.400]  You'll be able to catch up very quickly.
[23:51.080 --> 23:54.600]  So I'm going to go ahead and open up my spam filter sklearn.
[23:54.600 --> 23:57.500]  And in this one, we are going to use scikit-learn,
[23:57.500 --> 24:00.100]  which is a wonderful abstraction library.
[24:00.240 --> 24:01.660]  Great for beginners.
[24:03.500 --> 24:07.740]  So we're going to go ahead and install a couple tools.
[24:07.900 --> 24:11.580]  So for these blocks, in order to run them,
[24:11.580 --> 24:15.300]  you can click on them and press shift enter,
[24:15.300 --> 24:17.980]  or you can press run button up here.
[24:17.980 --> 24:20.260]  You can see that it's running all the Python code.
[24:20.260 --> 24:22.080]  In this one, I'm just installing the libraries
[24:22.080 --> 24:25.140]  and I am configuring the data directory
[24:25.880 --> 24:30.140]  where all of our spam and our ham emails will be located.
[24:31.680 --> 24:36.900]  So that's going to take a little bit while that is going.
[24:37.820 --> 24:41.380]  Just kind of a quick overview of this first block.
[24:41.380 --> 24:42.700]  We have our imports.
[24:42.700 --> 24:44.580]  So in this case, we're going to use matplotlib.
[24:45.180 --> 24:48.820]  Matplotlib is going to let us print some nice graphs.
[24:48.820 --> 24:50.360]  NumPy and Pandas is going to let us
[24:50.360 --> 24:53.020]  grab some of the information that we need.
[24:53.380 --> 24:56.280]  We're going to use a couple of the others
[24:57.520 --> 24:59.660]  as some support libraries.
[24:59.660 --> 25:01.560]  For example, OS, we're going to use that later
[25:01.560 --> 25:04.580]  to kind of grab all of the files.
[25:05.420 --> 25:07.480]  We want to use NLTK.
[25:07.480 --> 25:10.580]  NLTK stands for Natural Language Toolkit.
[25:10.820 --> 25:14.220]  And so this will allow us to identify our stop words
[25:14.220 --> 25:16.960]  and perform our stemming functions.
[25:18.100 --> 25:21.820]  From scikit-learn, we're going to use a train-test split.
[25:21.820 --> 25:25.440]  The way that you typically want to train a machine learning model
[25:25.440 --> 25:28.580]  is you have a bunch of clean labeled data
[25:29.340 --> 25:33.280]  and you're going to take probably about 80 or 70% of that
[25:33.280 --> 25:35.840]  and use it to train the model.
[25:35.840 --> 25:37.700]  But then you want to see how well the model is doing.
[25:37.700 --> 25:41.260]  So you're going to use the other 20 or 30% as your testing
[25:41.260 --> 25:43.060]  and you're going to use this as evaluation.
[25:43.440 --> 25:45.960]  It's kind of the same way we teach students in elementary school.
[25:45.960 --> 25:48.520]  Here's the bulk of the information and now we're going to test you
[25:48.520 --> 25:50.840]  to see how well you understand it.
[25:52.620 --> 25:55.880]  And then our TF-IDF and count vectorizers,
[25:55.880 --> 25:57.840]  we will get into that in a bit.
[25:57.840 --> 26:01.220]  These are going to be used for transforming our words
[26:01.220 --> 26:03.880]  into numbers that we can work with.
[26:04.960 --> 26:07.280]  And then the two models that we're actually going to build out
[26:07.280 --> 26:10.280]  are going to be the multinomial Naive Bayesian,
[26:10.280 --> 26:11.640]  which we went over in the slides,
[26:11.640 --> 26:15.980]  and then one that is completely unfamiliar, logistic regression.
[26:16.020 --> 26:21.240]  And this is to show how incredible machine learning can be
[26:21.240 --> 26:23.720]  where you can just swap out algorithms.
[26:23.780 --> 26:28.680]  So we can use SVM, random forest, neural networks,
[26:28.680 --> 26:31.460]  things of that nature without massively changing the way
[26:31.460 --> 26:33.180]  that our data is represented.
[26:33.980 --> 26:38.660]  And then to kind of represent how our models are doing,
[26:38.660 --> 26:41.980]  we're going to use Seaborn as a graphic representation,
[26:41.980 --> 26:43.160]  and we're going to use a confusion matrix
[26:43.590 --> 26:46.670]  and a classification report to identify exactly
[26:47.200 --> 26:49.880]  where our model is doing well, where it's not doing well,
[26:49.880 --> 26:53.920]  and how mild changes are going to affect it.
[26:54.580 --> 26:57.540]  So I'm going to go ahead, shift-enter, run that.
[26:58.160 --> 26:59.940]  And that's just going to do the import,
[26:59.940 --> 27:02.040]  and we'll see libraries imported.
[27:02.440 --> 27:04.660]  And then here's our test email from the slides.
[27:04.660 --> 27:07.100]  This is the East Asian fonts in Lenny.
[27:07.100 --> 27:10.560]  We're going to see every step of the way how this email is manipulated
[27:10.560 --> 27:14.440]  and transformed into workable data for our machine learning models.
[27:14.660 --> 27:18.080]  So I'm going to hit shift-enter and run that.
[27:20.840 --> 27:22.840]  TAs, are there any questions so far?
[27:22.840 --> 27:25.740]  Is everybody able to get their stuff set up?
[27:26.660 --> 27:29.660]  No questions so far. Looks like everything is going well.
[27:29.660 --> 27:30.540]  Perfect.
[27:31.720 --> 27:35.340]  So the tokenizer, this is the first step where we want to define
[27:36.000 --> 27:39.260]  a function that will pre-process our information.
[27:39.740 --> 27:42.920]  So, as we saw in the slides, we want to make everything lowercase,
[27:42.920 --> 27:46.300]  we want to remove the stop words, and we want to stem the words.
[27:46.880 --> 27:49.180]  So this function is already defined for you,
[27:49.180 --> 27:51.660]  but we can see that we grab all the punctuation,
[27:51.660 --> 27:54.220]  we grab all the stop words using nl2k,
[27:54.220 --> 27:59.280]  and then we define the porter stemmer, which is going to stem our words.
[27:59.780 --> 28:02.160]  And then we're going to create a list of tokens
[28:02.160 --> 28:06.920]  by taking all of our words, setting them to lowercase,
[28:06.920 --> 28:10.460]  and using the word tokenizer,
[28:10.460 --> 28:14.340]  which will just take a sentence, split it up by its individual words,
[28:14.340 --> 28:16.340]  we're going to strip out the punctuations,
[28:16.340 --> 28:19.700]  and then we are going to merge everything together.
[28:20.480 --> 28:22.360]  So we're going to go ahead and run that.
[28:22.360 --> 28:23.920]  It's going to define our function,
[28:23.920 --> 28:28.360]  and we're going to hop into our first task.
[28:28.560 --> 28:30.180]  So our first task is really easy.
[28:30.180 --> 28:33.800]  It's just print the full test email,
[28:33.800 --> 28:37.680]  and then we're going to print the tokenized version of the test email.
[28:37.820 --> 28:43.220]  So in Python we can just do print test email,
[28:43.220 --> 28:49.880]  and then we can do print tokenizer test email.
[28:50.160 --> 28:53.040]  And then shift-enter, going to go ahead and run that.
[28:53.040 --> 28:56.220]  So we see the original email, East Asian fonts in Leni.
[28:56.380 --> 28:58.340]  Then we can see the stop words have been removed.
[28:58.340 --> 29:01.660]  So we're looking at East Asian font Leni.
[29:01.800 --> 29:04.800]  Thanks has been turned into thank.
[29:05.240 --> 29:09.980]  Support install. Install is missing an L because we see installing.
[29:09.980 --> 29:14.440]  You have I installed something, or I am going to install something.
[29:14.440 --> 29:18.460]  And so we're going to stem that so that all of those words are counted the same.
[29:20.060 --> 29:24.540]  We see unsubscribe. Unsubscribe shows up twice.
[29:24.540 --> 29:30.700]  And then we see list.debian.org, which also shows up twice.
[29:30.700 --> 29:34.960]  So this is the way that our tokenizer modifies the email as we're going through.
[29:36.420 --> 29:39.280]  And again, just for those that are trying to catch up,
[29:39.280 --> 29:44.840]  all we did was print the test email and then print the tokenizer version of the test email.
[29:46.040 --> 29:50.520]  There's a question from Twitch chat.
[29:55.220 --> 30:01.520]  So why are we assuming the punctuation is not relevant?
[30:01.800 --> 30:05.180]  In our case, we're looking at spam emails.
[30:05.180 --> 30:09.320]  And so the punctuation, you'll look at commas, hyphens, question marks.
[30:09.320 --> 30:12.840]  And so in the context, if we're looking at keywords,
[30:13.560 --> 30:15.660]  the punctuation doesn't really matter.
[30:15.660 --> 30:18.200]  So if you see buy a Viagra now, period,
[30:18.200 --> 30:21.940]  it doesn't make any difference between buy Viagra now, exclamation point,
[30:21.940 --> 30:25.260]  or buy Viagra now with a question mark at the end.
[30:25.780 --> 30:28.300]  So in the context of the sentence,
[30:31.940 --> 30:34.940]  that one sentence would likely be classified as spam
[30:34.940 --> 30:38.440]  because it's talking about Viagra and it's very unlikely.
[30:38.440 --> 30:41.480]  That's a very common spam word we'll see in a lot of spam messages.
[30:41.480 --> 30:44.280]  So punctuation is just not relevant
[30:44.280 --> 30:47.480]  because it doesn't add any context to the message.
[30:48.720 --> 30:50.560]  And so that's something that we can remove.
[30:50.820 --> 30:54.100]  If you would like, you can follow through on this workbook
[30:54.100 --> 30:56.400]  and then try it without removing the punctuation
[30:56.400 --> 30:58.740]  and see if you notice any differences.
[30:59.960 --> 31:02.220]  But I don't think there would be any differences
[31:02.220 --> 31:04.460]  just because of that context.
[31:04.640 --> 31:09.320]  There are better things to weigh than periods, question marks, commas,
[31:09.320 --> 31:10.480]  things of that nature.
[31:10.480 --> 31:14.360]  And then as you see, emails or websites,
[31:14.360 --> 31:17.220]  such as lists.debian.org, they still have their punctuation
[31:17.220 --> 31:21.080]  because that's not the word lists, debian, and org.
[31:21.080 --> 31:24.460]  That is a URL, lists.debian.org.
[31:24.460 --> 31:26.360]  And so that's kind of the difference there.
[31:26.360 --> 31:30.120]  If we had hyphenated words, we would see them included in here as well.
[31:33.660 --> 31:36.920]  So we're going to go on and load the training data.
[31:36.920 --> 31:42.820]  This one is just going to give us the counts that we need.
[31:42.820 --> 31:44.500]  This is just done for you.
[31:44.500 --> 31:46.860]  Remember, one of those five things that we needed to keep track
[31:46.860 --> 31:52.440]  was all of the word counts in the ham emails,
[31:52.440 --> 31:54.940]  all the word counts in the spam emails.
[31:55.200 --> 31:58.120]  But what this is going to do is count the number of emails
[31:58.120 --> 32:01.760]  that we have in ham, count the number of emails that we have in spam,
[32:01.760 --> 32:04.120]  and then count the total number of emails
[32:04.120 --> 32:07.600]  that we have in our testing set.
[32:07.980 --> 32:12.180]  So for here, our testing and training split is already defined.
[32:12.180 --> 32:16.660]  It's 80-20, and so we'll see that in a moment.
[32:17.440 --> 32:20.020]  And we go ahead and hop into our second task,
[32:20.020 --> 32:22.200]  which is to load the training data.
[32:22.600 --> 32:25.900]  So task two, it asks us to create two lists,
[32:25.900 --> 32:27.260]  and we're going to have corpus.
[32:28.040 --> 32:31.160]  And it's important to get these words identical
[32:31.160 --> 32:33.240]  because of how we use them later on.
[32:33.240 --> 32:35.880]  So we're going to have a list called corpus.
[32:35.880 --> 32:38.700]  We're going to have a list called labels.
[32:39.300 --> 32:42.420]  And then the next bit is we want to load each of the email bodies
[32:42.420 --> 32:47.180]  from the data directory slash ham and data directory slash spam
[32:47.180 --> 32:49.680]  into the corpus array.
[32:49.840 --> 32:56.320]  So what we're going to do is for each in os.listdir,
[32:56.320 --> 33:02.280]  data.dir, this was defined above, slash ham.
[33:03.860 --> 33:09.660]  And then we're going to say with each open file,
[33:15.290 --> 33:18.900]  ham slash plus each,
[33:19.310 --> 33:20.970]  because that's what we defined in our for loop.
[33:20.970 --> 33:26.590]  We're going to read them as a file named f,
[33:26.590 --> 33:30.550]  corpus.append, f.read,
[33:30.550 --> 33:36.850]  and then labels.append the word ham.
[33:37.450 --> 33:39.570]  And the reason why is because this third task
[33:39.570 --> 33:42.950]  is load the labels for each email into the labels array.
[33:42.950 --> 33:44.910]  So in this first code block, we're going to read
[33:44.910 --> 33:47.710]  every single ham email in the ham directory.
[33:47.710 --> 33:50.770]  We're going to add the bodies to the corpus array,
[33:50.770 --> 33:56.050]  and then we're going to add the label ham into the labels array.
[33:56.050 --> 34:00.790]  Then we can go ahead and do the exact same thing for spam.
[34:00.790 --> 34:05.730]  We're just going to have to change each thing to spam,
[34:05.730 --> 34:09.210]  spam, and the labels to spam.
[34:09.630 --> 34:12.390]  And this will go through and read all of the emails.
[34:12.490 --> 34:15.090]  So I'm going to leave this up for a moment,
[34:15.090 --> 34:17.290]  let you guys copy that down.
[34:28.440 --> 34:32.000]  And then I'm going to press shift enter and run it.
[34:32.000 --> 34:35.340]  So this was really quick, just read all the ham emails,
[34:35.340 --> 34:39.080]  all the spam emails, and now it's loaded into our arrays.
[34:39.820 --> 34:44.300]  If we want to get an idea of what our class count looks like,
[34:44.300 --> 34:48.860]  we have a total of 2,359 ham emails,
[34:48.860 --> 34:53.880]  1,100 spam emails, and then just as a graph we can see here.
[34:54.340 --> 34:57.320]  And I think it's important to graph these counts
[34:57.320 --> 34:59.800]  because sometimes you will have what are considered
[34:59.800 --> 35:03.160]  unbalanced data sets, and so you need to
[35:03.160 --> 35:05.720]  perform other changes here and there
[35:05.720 --> 35:08.240]  just to make sure that the data sets are balanced.
[35:08.420 --> 35:12.400]  Especially for things like credit card fraud,
[35:12.400 --> 35:15.820]  where that happens less than a percent of the time.
[35:15.820 --> 35:18.280]  You can build a model that's 99% accurate
[35:18.280 --> 35:21.080]  by just saying everything is a legitimate transaction,
[35:21.080 --> 35:23.100]  but unfortunately that doesn't really help
[35:23.100 --> 35:25.800]  when you're trying to stop credit card fraud.
[35:26.080 --> 35:27.920]  So when you have imbalanced data sets,
[35:27.920 --> 35:29.660]  there's additional things that you do.
[35:29.660 --> 35:31.780]  In this case, it looks a little unbalanced,
[35:31.780 --> 35:35.900]  but we'll be able to work with it no problems.
[35:38.360 --> 35:41.120]  So we are on task 2A.
[35:41.460 --> 35:43.700]  In this one we want to see the full text body
[35:43.700 --> 35:45.620]  of a random email in our corpus array,
[35:45.620 --> 35:48.400]  and then we want to see the tokenized result of that same email.
[35:48.400 --> 35:52.300]  So this is going to be very similar to what we did for task 1,
[35:52.300 --> 35:53.760]  which was printing out the test email,
[35:53.760 --> 35:56.380]  and then printing out the tokenized version of the test email.
[35:56.380 --> 35:58.720]  So we'll just go ahead and grab a random email.
[35:58.720 --> 36:02.140]  So email equals corpus.
[36:03.240 --> 36:04.980]  Actually, let me go ahead and do this.
[36:04.980 --> 36:09.500]  Let's have email ID equals just the number 5.
[36:09.500 --> 36:10.280]  Why not?
[36:10.920 --> 36:15.060]  And then we're going to grab email ID.
[36:15.760 --> 36:17.820]  So this is going to be our random email.
[36:18.480 --> 36:20.340]  We're going to...
[36:20.340 --> 36:23.280]  probably going to print a...
[36:23.280 --> 36:25.380]  No, I already have a divider in there. Cool.
[36:25.380 --> 36:27.100]  So let's just print that email.
[36:28.180 --> 36:33.660]  And then, same thing, we're going to do print tokenizer,
[36:34.360 --> 36:36.500]  and then that same email.
[36:36.680 --> 36:41.360]  So this line and this line is exactly the same that we did in task 1.
[36:41.360 --> 36:44.260]  This is the only thing that changes so that we grab a random email.
[36:44.260 --> 36:46.980]  And then you don't have to put 5, you can put any number,
[36:46.980 --> 36:49.000]  but I'm just picking 5.
[36:49.280 --> 36:51.520]  So go ahead, Shift-Enter.
[36:52.440 --> 36:54.800]  And this one's pretty nice.
[36:54.800 --> 37:01.620]  So we see the body of the email, product review, virttools, dev, URL,
[37:01.620 --> 37:06.380]  product review, virttool, dev, 2.0 URL.
[37:06.600 --> 37:12.420]  We see the full path for the URL here.
[37:12.940 --> 37:15.600]  Let's see if there's anything interesting.
[37:15.780 --> 37:19.340]  Maybe features, feature is shortened to feature,
[37:20.000 --> 37:23.780]  because you can say featured in a motion picture with an ed at the end,
[37:23.780 --> 37:30.100]  features as they have here, or we will feature with just an e at the end.
[37:31.320 --> 37:34.060]  So this is the way that our tokenizer operates,
[37:34.060 --> 37:37.460]  but we still need to turn these words into numbers,
[37:37.460 --> 37:41.440]  and this is where our vectorizers come in.
[37:41.980 --> 37:44.440]  So our vectorizers, we're going to use two kinds of vectorizer.
[37:44.440 --> 37:47.600]  We're going to use the count vectorizer, which just takes the straight word count.
[37:47.600 --> 37:49.760]  It's exactly what we did in the slides.
[37:49.760 --> 37:56.460]  Then we're going to use a TF-IDF vectorizer that applies a mathematical weight to each of the words.
[37:56.680 --> 38:00.820]  And they work slightly different, but let's go ahead and define them,
[38:00.820 --> 38:04.440]  and we'll show you what the difference looks like.
[38:04.440 --> 38:08.140]  So we're going to name this CVEC as our count vectorizer.
[38:08.460 --> 38:12.580]  Count vectorizer, and the capitalizations are important.
[38:12.680 --> 38:16.500]  Then we're going to use our tokenizer function above as our tokenizer function here.
[38:16.500 --> 38:20.840]  So we're going to have tokenizer equals tokenizer.
[38:21.620 --> 38:22.660]  Awesome.
[38:23.180 --> 38:30.980]  And then we want to vectorize our words, so this is going to be saved as countX.
[38:30.980 --> 38:36.500]  So we're going to have count underscore capital X equals...
[38:37.020 --> 38:38.920]  Oops, did not mean to open that.
[38:42.860 --> 38:46.660]  CVEC, our vectorizer, and then we're going to use fit transform.
[38:47.900 --> 38:51.320]  Fit is Scikit-learn's way of saying train.
[38:51.360 --> 38:56.000]  So in this instance, we're going to train our vectorizer.
[38:56.320 --> 39:02.580]  So fit transform, and then our purpose, which is just our list of words.
[39:02.980 --> 39:06.640]  And then we're going to do the same thing for our TF-IDF vectorizer.
[39:06.640 --> 39:10.920]  So we're going to take TVEC equals TF-IDF.
[39:11.460 --> 39:14.420]  Vectorizer and capitalization is important.
[39:14.420 --> 39:17.720]  Tokenizer equals tokenizer.
[39:17.720 --> 39:31.280]  And then our TF-IDF X capital X equals TVEC dot fit transform corpus.
[39:32.180 --> 39:33.460]  Cool.
[39:33.460 --> 39:37.520]  So I'm going to let that sit there for a second while you guys are typing in.
[39:43.580 --> 39:46.000]  And then I'm going to go ahead and run it.
[39:46.600 --> 39:56.380]  So this takes about a minute because what it's doing is it is counting all of the words across all of the emails, the ham emails and the spam emails.
[39:56.660 --> 40:06.460]  And so as it's counting each of these words, it's applying a word count for the count vectorizer, and then it applies a weight for the TF-IDF vectorizer.
[40:06.460 --> 40:08.900]  And the weight takes all the words into account.
[40:08.900 --> 40:10.620]  So it becomes a whole process.
[40:10.620 --> 40:15.720]  But this takes about 60 to 120 seconds.
[40:15.960 --> 40:18.520]  So we'll go ahead and let that run.
[40:18.600 --> 40:27.320]  While that's running, we're just going to get started on task 3A, which is just to print each token from the test email.
[40:27.840 --> 40:29.460]  Now this one can be a little tricky.
[40:29.460 --> 40:34.700]  So there's a bit of a trick to identify the unique words and the counts.
[40:34.700 --> 40:39.800]  So what we're going to do is create a list of dictionary keys.
[40:39.800 --> 40:47.100]  So what that's going to look like is for i in list dict from keys.
[40:48.480 --> 40:55.440]  Tokenized email was defined above, and what that is is the tokenized version of our test email.
[40:55.820 --> 40:57.900]  So we're going to go ahead and print.
[40:58.480 --> 41:01.280]  I'm just using the Python format.
[41:04.220 --> 41:12.520]  Tokenized email dot count of each word, and then the word in the tokenized email.
[41:12.880 --> 41:16.220]  So I'm going to verify vectorizing is complete.
[41:17.360 --> 41:22.100]  And I'm going to let you guys run your vectorizers, let that sit for a little bit.
[41:22.100 --> 41:24.280]  And then we're going to run this, because we need...
[41:25.680 --> 41:28.720]  Oh, I don't believe we do need the previous one to run this.
[41:28.720 --> 41:34.840]  So if your vectorizer is still going, your code here will not run right away,
[41:34.840 --> 41:38.200]  but it will run as soon as the vectorizer completes.
[41:38.440 --> 41:40.540]  So I'm going to go ahead and run mine.
[41:40.560 --> 41:45.900]  So shift-enter, and tokenizer email is not correct.
[41:45.900 --> 41:50.400]  I think it's tokenized email, that makes a little bit more sense.
[41:50.580 --> 41:54.140]  I'm going to run that. Tokenized email is not defined.
[41:55.140 --> 42:01.220]  Okay, give me a second. I'm going to scroll up to our tokenizer.
[42:07.040 --> 42:09.680]  I guess I don't have that defined.
[42:12.820 --> 42:14.940]  Where is that?
[42:22.370 --> 42:25.550]  Ah, I was supposed to save it up above. That's okay.
[42:25.690 --> 42:28.590]  So we'll go ahead and create it here.
[42:29.750 --> 42:36.370]  Tokenized email equals tokenizer test email.
[42:36.370 --> 42:38.450]  So we're going to go ahead and save that.
[42:40.790 --> 42:43.910]  And here we have the word counts of our test email.
[42:43.910 --> 42:46.550]  We see the familiar East Asian font, Lenny.
[42:46.550 --> 42:53.110]  We know unsubscribe showed up twice, and we know list.debian.org showed up twice.
[42:53.110 --> 42:56.690]  Everything else in here only showed up once.
[42:56.690 --> 43:02.790]  I know I went very quickly, but this was the line that I needed to add.
[43:02.790 --> 43:06.130]  Tokenized email equals tokenizer test email.
[43:07.130 --> 43:15.110]  And then we grab the list of the dictionary from that list of the tokenized email, and we print out the word counts.
[43:18.390 --> 43:26.130]  So going on to 3b, what we want to see is, okay, we know what the word count looks like if we just did a straight word count.
[43:26.130 --> 43:34.130]  How does the count vectorizer and the tf.idf vectorizer transform our test email as we're going through?
[43:34.610 --> 43:41.590]  So we're going to go ahead and use our example CVEC.
[43:41.590 --> 43:48.110]  And the reason why we're using a separate example version is because we want to redefine a count vectorizer.
[43:50.330 --> 43:54.310]  Don't forget our tokenizer equals tokenizer.
[43:55.330 --> 44:08.290]  And then we're going to have our exampleX using our exampleCVEC.fitTransform.
[44:08.450 --> 44:11.510]  And then it expects a list.
[44:12.110 --> 44:20.330]  Our test email is a simple string, so we're going to have to wrap it in these brackets so that it becomes a list.
[44:20.330 --> 44:22.970]  And then we're going to have our test email.
[44:24.310 --> 44:25.410]  Cool.
[44:25.690 --> 44:31.560]  And then we're going to go ahead and print our example email.
[44:32.470 --> 44:33.510]  Oops.
[44:33.770 --> 44:35.370]  ExampleX.
[44:36.050 --> 44:40.170]  And then here, we're going to do the same thing.
[44:40.170 --> 44:47.270]  We're going to have exampleTVEC equals tf.idf vectorizer.
[44:47.270 --> 44:49.310]  Capitals are important.
[44:49.350 --> 44:51.430]  Tokenizer equals tokenizer.
[44:51.430 --> 44:57.710]  And then our example, tf.idf.
[44:57.710 --> 45:00.110]  I'll just call it exampleX because it doesn't matter.
[45:00.110 --> 45:03.790]  Once we print out our original exampleX, we don't need it anymore.
[45:07.830 --> 45:09.890]  ExampleTVEC.fitTransform.
[45:13.630 --> 45:17.890]  And then the brackets, test email.
[45:18.470 --> 45:22.750]  And then we can print our exampleX.
[45:23.670 --> 45:26.030]  So I'm going to let that sit for a moment.
[45:26.130 --> 45:28.230]  Let you guys copy that down.
[45:32.690 --> 45:35.210]  I'm going to just look ahead real quick.
[45:35.310 --> 45:36.990]  We are doing great.
[45:40.670 --> 45:41.370]  Okay.
[45:41.370 --> 45:43.310]  So I'm going to go ahead and run this now.
[45:43.310 --> 45:44.870]  I pressed Shift-Enter.
[45:45.910 --> 45:53.090]  And if we look at our count vectorizer of test email, we see these different numbers.
[45:53.090 --> 45:56.150]  And what these numbers represent are tokens.
[45:56.150 --> 45:59.390]  So, for example, a very nice game.
[45:59.390 --> 46:01.170]  A would be token 1.
[46:01.170 --> 46:02.750]  Very would be token 2.
[46:02.750 --> 46:04.130]  Nice would be token 3.
[46:04.130 --> 46:05.970]  Game would be token 4.
[46:06.530 --> 46:11.850]  So in this case, Scikit-Learn actually knows the best way that it wants to tokenize things.
[46:11.850 --> 46:17.810]  So the letters, for example, 0, this does not necessarily represent East.
[46:18.930 --> 46:23.310]  And so, basically, Scikit-Learn knows how these tokens are used.
[46:23.310 --> 46:24.950]  We don't know how these tokens are used.
[46:24.950 --> 46:32.490]  But something that we can point out is we see token 16 shows up twice and token 9 shows up twice.
[46:32.490 --> 46:41.850]  So based on what we saw up above, we can guarantee that either token 16 or token 9 is unsubscribe.
[46:41.850 --> 46:45.950]  And then the other one is list.debian.org.
[46:46.350 --> 46:47.510]  And so these show up twice.
[46:47.510 --> 46:50.730]  So this is exactly the same as if we had done the word count ourselves.
[46:51.390 --> 46:58.830]  The TF-IDF vectorizer, however, uses a bit of a weight algorithm.
[46:58.830 --> 47:03.270]  And so everything that was set as 1 has this 0.204 weight.
[47:03.450 --> 47:10.410]  The 2, which is token 9 and token 16, have this 0.408 weight.
[47:10.650 --> 47:14.810]  And this doesn't necessarily happen in every instance.
[47:14.810 --> 47:22.090]  But in this instance, you can see the 0.408 shows up twice as much as the 0.204 exactly.
[47:22.710 --> 47:28.030]  And so this is the way that count vectorizer operates and TF-IDF vectorizer operates.
[47:29.270 --> 47:33.810]  The underlying, what happens underneath the hood, we don't really need to worry about.
[47:33.810 --> 47:40.070]  We just need to know how these work and the differences between the two.
[47:42.950 --> 47:46.030]  So moving on, we're going to load the testing data.
[47:46.030 --> 47:47.390]  This is already written for us.
[47:47.390 --> 47:54.130]  So we're just going to grab the first 10 files in the testing directory and print those out.
[47:54.130 --> 47:57.930]  So we see we just have all these test emails.
[47:57.970 --> 48:02.870]  And at the end of the file name, we have the label that it's supposed to belong to.
[48:02.870 --> 48:06.110]  So we have ham, ham, spam, spam, spam, ham.
[48:08.830 --> 48:12.130]  And so what we're going to do is read each of these emails.
[48:12.130 --> 48:17.890]  And then we're going to pull off the last little bit of the file name and use that as our label.
[48:17.950 --> 48:22.950]  And we want to have the testing separate from the training so that we can evaluate how well our model is doing.
[48:23.450 --> 48:26.310]  So we're going to load the testing data.
[48:26.310 --> 48:30.630]  And this one's a little complicated, so we have a couple tasks to run through.
[48:30.630 --> 48:32.970]  So we're going to create two list arrays, kind of like we did earlier.
[48:32.970 --> 48:37.150]  We're going to have test corpus, and we're going to set that as an empty list.
[48:37.150 --> 48:42.610]  And we're going to have test labels, which is also going to be an empty list.
[48:43.390 --> 48:48.650]  And then we are going to load all of the emails from the test directory.
[48:48.650 --> 49:05.650]  So we're going to say, for filename in os.listdir, our data directory, plus test, which is where it's located, with each file that is open.
[49:05.650 --> 49:14.830]  And we need to tell it which file test plus filename are.
[49:15.470 --> 49:26.050]  For read, as filenamed f, we are going to do test corpus.append, f.read.
[49:26.690 --> 49:32.610]  And then we are going to do something a little weird here, so bear with me.
[49:32.610 --> 49:37.110]  I'm just going to use re.split, just to split this up.
[49:37.310 --> 49:39.690]  I'm looking for txt.
[49:41.750 --> 49:46.250]  And then we're just going to grab the last bit in that split.
[49:46.250 --> 49:51.990]  So this is going to split the file trainemail.txt.
[49:52.450 --> 49:57.210]  So it's going to split it to this section, and then to this section.
[49:58.050 --> 50:05.830]  So this would be considered index 0, and this last little bit would be considered index 1, which is why we have this 1 here at the end.
[50:07.570 --> 50:09.290]  And then we're going to append the labels.
[50:09.290 --> 50:14.190]  So test.labels.appendLabel
[50:16.430 --> 50:21.290]  And so I'm going to give you guys a little bit to go ahead and get this all written up.
[50:21.410 --> 50:27.090]  Be careful, there is a backslash before this period, because this is a regular expression.
[50:29.650 --> 50:31.590]  And everything should be good.
[50:32.390 --> 50:35.670]  So I'm just going to let that sit for a little bit, let you guys do your thing.
[50:35.670 --> 50:37.850]  In the meantime, I'm just going to look ahead real quick.
[50:39.950 --> 50:41.850]  The rest of this gets so much easier.
[50:42.030 --> 50:44.750]  It's pretty crazy how easy it gets.
[50:46.030 --> 50:48.890]  In the meantime, TAs, are there any questions in the chat?
[50:57.320 --> 50:59.620]  We appear to be good so far.
[51:00.060 --> 51:01.020]  Awesome.
[51:03.200 --> 51:06.260]  So I'm going to go ahead and move on.
[51:06.260 --> 51:07.720]  I'm going to press Shift-Enter.
[51:07.720 --> 51:10.440]  It's going to load all of those emails.
[51:10.440 --> 51:19.660]  So this is just reading all of the test emails into our corpus array and grabbing those labels and putting them into the end of our test labels array.
[51:20.160 --> 51:24.060]  And then just like last time, what this is going to do is print out the number of test emails.
[51:24.060 --> 51:27.780]  We have 590 ham emails, 276 spam emails.
[51:27.780 --> 51:31.200]  And here's just a nice pretty graph representing what that looks like.
[51:31.860 --> 51:37.900]  And as I said earlier, this information or this data has already been split up in an 80-20 split.
[51:37.900 --> 51:43.900]  So 80% of it is being used for our training and then 20% of it is being used for our testing.
[51:46.080 --> 51:51.460]  And then this one, it might sound complicated, but this is actually really easy.
[51:51.460 --> 51:55.760]  We have already trained our CVEC and our TVEC.
[51:55.760 --> 52:01.700]  And now all we need to do is collect our testing values.
[52:01.700 --> 52:06.660]  We need to tokenize our testing values using these already trained vectorizers.
[52:06.660 --> 52:08.800]  So this is super easy.
[52:10.800 --> 52:19.860]  testCountX equals CVEC.transformTestCorpus.
[52:20.140 --> 52:23.060]  And that's it. That's all we've got to do for that.
[52:23.480 --> 52:25.320]  We can just copy and paste here.
[52:25.580 --> 52:31.760]  Say our testTFIDFX is our TVEC.transformTestCorpus.
[52:32.560 --> 52:38.740]  The reason this works is if you remember in the slides how we had the sentences that we already knew about,
[52:38.740 --> 52:43.120]  and then we had to perform a counting game for a very close game.
[52:43.180 --> 52:46.680]  We had to count how many times a showed up in the sentences that we were already aware of,
[52:46.680 --> 52:50.820]  how many times very showed up, how many times close showed up, how many times game showed up.
[52:50.820 --> 52:54.260]  So this transform allows us to do just that.
[52:54.260 --> 52:57.740]  So what this will do is grab each email and it will say,
[52:57.740 --> 53:02.040]  okay, here's the tokenized words of the email that I have.
[53:02.040 --> 53:04.500]  How many times does it show up in the count vectorizer?
[53:04.500 --> 53:08.940]  And then what is the weight that it shows up in the TFIDF vectorizer?
[53:09.740 --> 53:12.640]  So I'm going to go ahead, Shift-Enter, run that.
[53:13.180 --> 53:15.260]  And just like last time, this is going to take a little bit.
[53:15.340 --> 53:18.900]  It's counting all the words and associating them with the different tokens.
[53:19.280 --> 53:21.980]  This is going to be significantly faster than training it.
[53:21.980 --> 53:23.680]  See, it's almost done already.
[53:25.020 --> 53:26.720]  And so we're going to go ahead and move on.
[53:28.580 --> 53:32.700]  So this bit, we're going to get ready to test and evaluate our models.
[53:32.700 --> 53:36.840]  So this is just a helper function that's going to show us a bunch of really neat statistics.
[53:37.100 --> 53:40.940]  The details of this helper function you can look into on your own.
[53:41.180 --> 53:47.000]  But for now, what we need to use as an input is our confusion matrix,
[53:47.000 --> 53:50.400]  our score, and our classification report.
[53:51.660 --> 53:54.600]  So we're just going to go ahead, Shift-Enter, run that.
[53:54.960 --> 53:57.780]  Definition defined.
[53:59.220 --> 54:04.740]  And then here's where we get into tasks 6a, 6b, 6c, and 6d,
[54:04.740 --> 54:09.300]  which is going to be us training and evaluating our different models.
[54:09.380 --> 54:13.380]  So we're going to start with multinomial Naive Bayesian with a TF-IDF vectorizer,
[54:13.380 --> 54:18.140]  with a count vectorizer, and then we're going to do the same for logistic regression with TF-IDF
[54:18.140 --> 54:22.040]  and logistic regression with a count vectorizer.
[54:24.020 --> 54:26.560]  So we're going to start out just one thing at a time.
[54:27.320 --> 54:30.420]  We're going to define our constructor.
[54:30.420 --> 54:36.480]  So this is going to be MNB TF-IDF equals multinomial NB.
[54:36.480 --> 54:38.660]  Watch your capitalization.
[54:39.580 --> 54:42.360]  We're going to define that, and then we're going to train it.
[54:43.740 --> 54:52.220]  MNB TF-IDF dot fit, which is train in scikit-learn world.
[54:52.220 --> 54:57.140]  And we're going to use our TF-IDF X, which is our preprocessed information,
[54:57.140 --> 55:00.080]  and then we're going to compare it to our labels.
[55:02.060 --> 55:03.860]  So that's going to train our model.
[55:03.860 --> 55:07.300]  Now we're going to grab a couple really neat statistics.
[55:07.300 --> 55:12.020]  So the first one we're going to get is our score, and this wants us to call it score.
[55:14.380 --> 55:18.880]  Score MNB TF-IDF predictions using our predict function.
[55:18.880 --> 55:23.100]  We're going to develop a confusion matrix called C matrix, MNB TF-IDF,
[55:23.100 --> 55:27.440]  and then classification report, C report, MNB TF-IDF.
[55:28.540 --> 55:30.280]  So we're going to start with our score.
[55:30.280 --> 55:37.380]  Score MNB TF-IDF equals MNB TF-IDF dot score.
[55:37.980 --> 55:41.660]  And then we're going to use our test values, because we've trained it on our training data,
[55:41.660 --> 55:43.280]  and we want to test it with our testing.
[55:43.280 --> 55:47.420]  So we're going to do test TF-IDF X,
[55:47.420 --> 55:54.060]  because that's the way that our system was vectorizing these.
[55:54.500 --> 55:57.520]  So we don't want to use count vectorizer with a TF-IDF train,
[55:57.520 --> 55:58.840]  because that's not going to work.
[55:59.140 --> 56:01.540]  And then we're going to use test labels.
[56:03.980 --> 56:05.200]  That's going to give us our score.
[56:05.200 --> 56:06.860]  So we're going to get our predictions.
[56:07.180 --> 56:14.760]  Predictions MNB TF-IDF equals MNB TF-IDF dot predict.
[56:14.900 --> 56:18.600]  And this is going to give us our raw predictions.
[56:21.360 --> 56:26.080]  This is going to tell us what it thinks each email is,
[56:26.080 --> 56:29.720]  whereas our score is going to see how accurate it is.
[56:29.720 --> 56:33.880]  And then this is going to be C matrix.
[56:34.880 --> 56:42.680]  MNB TF-IDF equals confusion matrix.
[56:42.980 --> 56:48.620]  We're going to use our test labels and our predictions.
[56:48.780 --> 56:51.280]  MNB TF-IDF.
[56:51.280 --> 56:56.800]  So this is going to tell us what it thinks the value is,
[56:56.800 --> 56:58.400]  if it's a spam or ham email,
[56:58.400 --> 57:01.300]  and then what it actually is.
[57:01.500 --> 57:03.940]  So it's going to give us our true positives, our false positives,
[57:03.940 --> 57:05.740]  our true negatives, and our false negatives.
[57:06.080 --> 57:07.700]  And then the C report.
[57:07.700 --> 57:10.040]  We're not going to do too much with this,
[57:10.040 --> 57:14.160]  but what a classification report will do is show us some more metrics,
[57:14.160 --> 57:16.620]  such as the recall and the precision,
[57:16.620 --> 57:18.080]  which we don't need to worry about too much.
[57:18.080 --> 57:20.340]  We're just going to focus on the accuracy for now.
[57:20.340 --> 57:23.840]  But for data science, it's important to know what the recall
[57:23.840 --> 57:28.740]  and the precision are as we go through things.
[57:30.340 --> 57:34.800]  Report, test labels, and then we're going to use our predictions again.
[57:36.180 --> 57:43.440]  And this is just like our C matrix MNB TF-IDF,
[57:44.100 --> 57:45.740]  in the order that we put things.
[57:45.740 --> 57:48.020]  We have test labels, test labels, predictions, predictions.
[57:48.820 --> 57:49.920]  And that is it.
[57:49.920 --> 57:52.240]  And then we're going to use our helper function generate report.
[57:53.560 --> 57:55.680]  So I know that's a lot of typing.
[57:55.720 --> 58:00.420]  I'm going to go ahead and sit here as you guys kind of follow through.
[58:00.420 --> 58:03.880]  So we have our MNB TF-IDF constructor.
[58:03.880 --> 58:07.120]  We are going to train it using the fit function.
[58:07.120 --> 58:09.440]  We're going to train it on our testing data,
[58:09.440 --> 58:11.140]  which is stored as TF-IDFX,
[58:11.140 --> 58:13.980]  and our testing labels, which is just stored as labels.
[58:13.980 --> 58:19.960]  We are going to use score, MNB TF-IDF.score,
[58:19.960 --> 58:22.560]  to evaluate our accuracy.
[58:22.880 --> 58:25.740]  We're going to use our predictions to get the raw predictions.
[58:25.740 --> 58:27.260]  And then with the raw predictions,
[58:27.260 --> 58:32.260]  we are going to develop a confusion matrix and a classification report,
[58:32.260 --> 58:35.620]  which will tell us how well our model is doing visually.
[58:38.920 --> 58:41.100]  So I'm going to let that sit for one more minute.
[58:41.100 --> 58:43.220]  TA, are there any questions?
[58:52.160 --> 58:56.320]  No, I think... well, hang on, let me go back one here.
[58:57.940 --> 59:00.300]  Nope, still got it. So I think we're good.
[59:00.340 --> 59:02.460]  Awesome. Awesome. Fantastic.
[59:03.500 --> 59:08.100]  So I am going to go ahead and move on.
[59:08.100 --> 59:10.400]  I'm going to press Shift-Enter. It's going to run.
[59:11.260 --> 59:16.800]  And now we have our... up here at the top is our classification report.
[59:17.280 --> 59:19.780]  And so what this is going to do is show us our precision
[59:19.780 --> 59:23.460]  and our recall. We don't need to worry too much about this one.
[59:23.620 --> 59:26.140]  Instead, I want you to focus on the lower chart.
[59:26.140 --> 59:28.680]  So here we have an accuracy of 87.5%.
[59:29.420 --> 59:35.840]  We can see that all of the actual ham emails were predicted as ham.
[59:35.840 --> 59:37.800]  So these are our true positives.
[59:38.520 --> 59:43.300]  A little more than half of our spam emails were correctly predicted as spam.
[59:43.300 --> 59:45.760]  So these are our true negatives.
[59:45.760 --> 59:51.020]  And then, unfortunately, it misclassified...
[59:51.760 --> 59:53.760]  Oh, my mistake. I have that backwards.
[59:53.760 --> 01:00:00.320]  So it says 108 of our ham emails it thinks are spam emails,
[01:00:00.320 --> 01:00:04.860]  whereas all of the spam emails, so the actual spam emails,
[01:00:04.860 --> 01:00:08.580]  were correctly classified as spam.
[01:00:08.580 --> 01:00:17.880]  And then all of our ham emails, what it thought were ham emails, were ham.
[01:00:20.760 --> 01:00:26.440]  So we're going to go ahead and do the same thing for 6b, 6c, and 6d.
[01:00:27.460 --> 01:00:30.460]  One of the beautiful things, and I'm going to show you this
[01:00:30.460 --> 01:00:36.800]  just as a super quick and easy way of going about and doing things,
[01:00:36.800 --> 01:00:39.540]  is I'm just going to copy the code that I'd already written.
[01:00:39.540 --> 01:00:42.360]  This is why machine learning is awesome,
[01:00:42.360 --> 01:00:45.740]  and very different from programming normal algorithms,
[01:00:45.740 --> 01:00:49.240]  is all we need to do is just change these tf.ids to count,
[01:00:49.240 --> 01:00:51.740]  and then everything else is the same.
[01:00:52.880 --> 01:00:57.820]  So we use mnbcount to define our multinomial Naive Bayesian,
[01:00:57.820 --> 01:01:02.140]  as we have been asked of us here in 6b.
[01:01:03.080 --> 01:01:05.320]  mnbcount.fit, and we're going to use our countX
[01:01:05.320 --> 01:01:07.560]  because that's how it's been vectorized.
[01:01:08.120 --> 01:01:13.980]  We're going to change our score mnbcount using mnbcount.score.
[01:01:13.980 --> 01:01:20.740]  We're going to change our tf.idsx to testCountX,
[01:01:21.320 --> 01:01:25.580]  and then likewise kind of all throughout here.
[01:01:29.310 --> 01:01:31.310]  TestLabels is going to stay the same,
[01:01:31.310 --> 01:01:35.010]  predictions is going to change because it is no longer predictions.tf.ids,
[01:01:35.010 --> 01:01:36.450]  it is predictionsCount,
[01:01:36.810 --> 01:01:41.310]  and then we have our confusion matrix and our classification report.
[01:01:41.670 --> 01:01:43.150]  So just to go over it one more time,
[01:01:43.150 --> 01:01:46.590]  we define our multinomial Naive Bayesian constructor,
[01:01:46.590 --> 01:01:50.670]  we train it using the fit function and the count vectorizer,
[01:01:50.670 --> 01:01:53.410]  our testing data, our training data.
[01:01:54.250 --> 01:02:02.470]  We develop our score off of our testCount data using our testLabels,
[01:02:02.470 --> 01:02:05.910]  and then we have our predictions,
[01:02:05.910 --> 01:02:08.650]  we have our confusion matrix,
[01:02:08.650 --> 01:02:10.350]  and our classification report.
[01:02:10.350 --> 01:02:12.910]  So I'm going to let this sit for just a little bit
[01:02:12.910 --> 01:02:16.910]  as you guys go through and copy that down.
[01:02:25.810 --> 01:02:30.290]  What's really cool is in the GitHub repository,
[01:02:30.290 --> 01:02:32.370]  there are other workbooks.
[01:02:32.370 --> 01:02:35.630]  So for example, we use the exact same methodology
[01:02:35.630 --> 01:02:41.010]  to build a malicious URL predictor,
[01:02:41.010 --> 01:02:45.510]  so things that look like it's possibly part of a botnet.
[01:02:47.510 --> 01:02:49.590]  All of this code is exactly the same.
[01:02:49.590 --> 01:02:51.230]  Once we vectorize our information,
[01:02:51.230 --> 01:02:53.230]  we just copy and paste this code in,
[01:02:53.230 --> 01:02:54.990]  and it does the exact same thing.
[01:02:55.250 --> 01:02:57.550]  And so that's the benefit of machine learning,
[01:02:57.550 --> 01:02:58.890]  is it learns from the data itself
[01:02:58.890 --> 01:03:01.110]  instead of us having to manually program,
[01:03:01.110 --> 01:03:02.670]  okay, this solves problem A,
[01:03:02.670 --> 01:03:05.610]  but now we have to rewrite this to solve problem B.
[01:03:05.610 --> 01:03:07.450]  It just all does the same thing.
[01:03:09.290 --> 01:03:13.390]  So I am going to go ahead and move on.
[01:03:13.390 --> 01:03:15.390]  I'm going to press Shift-Enter.
[01:03:16.710 --> 01:03:18.710]  And we see that we got a little bit better, right?
[01:03:18.710 --> 01:03:21.090]  We are at a 94% accuracy.
[01:03:21.630 --> 01:03:27.030]  It thinks that some of our ham emails are still spam,
[01:03:27.030 --> 01:03:28.150]  but significantly less.
[01:03:28.150 --> 01:03:30.250]  We went from 108 down to 23,
[01:03:30.250 --> 01:03:35.610]  but now it thinks that some of our spam emails
[01:03:35.610 --> 01:03:37.750]  are actually ham.
[01:03:37.810 --> 01:03:38.670]  And so that's a problem.
[01:03:38.670 --> 01:03:41.450]  So we have a few more emails going through our filter
[01:03:41.450 --> 01:03:42.870]  than we would like,
[01:03:42.870 --> 01:03:47.590]  but we have significantly fewer of our ham emails
[01:03:47.590 --> 01:03:49.250]  from being blocked.
[01:03:50.590 --> 01:03:54.110]  GT, we've got a question from Twitch chat.
[01:03:54.110 --> 01:03:54.650]  Absolutely.
[01:03:54.650 --> 01:03:56.670]  So Chris...
[01:03:56.670 --> 01:03:59.270]  I'm not going to try and pronounce the name.
[01:03:59.270 --> 01:04:02.970]  Did you intentionally avoid having spam emails classified as ham
[01:04:02.970 --> 01:04:04.330]  when you created the model,
[01:04:04.330 --> 01:04:07.130]  or is it simply a try and error kind of thing
[01:04:07.130 --> 01:04:08.610]  when training the model?
[01:04:09.550 --> 01:04:13.530]  Intentionally avoid having spam emails classified as ham?
[01:04:14.170 --> 01:04:16.430]  Is that what they said?
[01:04:16.430 --> 01:04:17.310]  Yeah.
[01:04:17.990 --> 01:04:22.050]  Yeah, we wouldn't want spam emails to be classified as ham.
[01:04:22.050 --> 01:04:24.390]  We want to draw a line and separate
[01:04:24.390 --> 01:04:26.930]  so that you stop getting emails about Viagra
[01:04:26.930 --> 01:04:30.990]  and you start getting the regular emails from your school or your work.
[01:04:31.130 --> 01:04:34.330]  And so this is just kind of the way that spam filters operate.
[01:04:34.870 --> 01:04:39.610]  So we could build a model that would classify more spam emails
[01:04:39.610 --> 01:04:43.610]  as being part of ham or regular emails,
[01:04:43.610 --> 01:04:47.610]  or we can build a model that says,
[01:04:47.610 --> 01:04:49.990]  okay, we want absolutely zero spam emails.
[01:04:49.990 --> 01:04:53.130]  If we wind up losing regular emails
[01:04:53.130 --> 01:04:55.830]  and classifying those as spam as a mistake,
[01:04:55.830 --> 01:04:57.390]  we're okay with that.
[01:04:57.770 --> 01:05:00.630]  A really good example, actually I'm glad you brought this up,
[01:05:00.630 --> 01:05:01.990]  is antivirus.
[01:05:02.170 --> 01:05:04.150]  So the way that antivirus companies work
[01:05:04.150 --> 01:05:07.190]  is they can err on the side of,
[01:05:07.670 --> 01:05:11.030]  oh, I think this is a virus, therefore it must be a virus,
[01:05:11.030 --> 01:05:13.350]  or this could be benign,
[01:05:13.350 --> 01:05:15.750]  or I kind of think this is a virus,
[01:05:15.750 --> 01:05:18.250]  but instead of blocking it, I'm just going to let it go through.
[01:05:18.870 --> 01:05:20.790]  And so you think about the way that malware works.
[01:05:23.110 --> 01:05:26.450]  You have malware that will make calls out to the internet,
[01:05:26.450 --> 01:05:28.190]  it'll probably take over your webcam,
[01:05:28.190 --> 01:05:29.650]  it'll try and write files,
[01:05:29.650 --> 01:05:31.490]  it'll probably take over your microphone.
[01:05:31.750 --> 01:05:33.150]  What else does that?
[01:05:33.470 --> 01:05:37.910]  Chrome, IE, Firefox, browsers.
[01:05:38.070 --> 01:05:40.350]  And so the idea is instead of blocking
[01:05:40.350 --> 01:05:45.170]  what could be legitimate applications,
[01:05:45.170 --> 01:05:48.050]  and so that would really interrupt the user experience,
[01:05:48.050 --> 01:05:50.930]  they kind of err on the side of,
[01:05:50.930 --> 01:05:52.830]  we don't want to block the user experience
[01:05:52.830 --> 01:05:55.870]  because then you'll uninstall the antivirus.
[01:05:55.990 --> 01:06:01.310]  So they are more willing to let viruses go through
[01:06:01.310 --> 01:06:04.570]  than blocking legitimate applications.
[01:06:05.450 --> 01:06:06.630]  And so in this case,
[01:06:06.630 --> 01:06:09.230]  these are trade-offs that we have to make as data scientists
[01:06:09.230 --> 01:06:13.350]  of if we are okay with letting a few spam emails through
[01:06:13.350 --> 01:06:15.730]  versus blocking all of our regular emails.
[01:06:16.790 --> 01:06:20.670]  So he's specifically asking about task 6A.
[01:06:20.670 --> 01:06:22.150]  So if you scroll up a little bit.
[01:06:22.150 --> 01:06:23.330]  Task 6A?
[01:06:23.710 --> 01:06:27.790]  Yeah, I think the answer is a bit of that
[01:06:27.790 --> 01:06:33.650]  and a bit of the question of when you build these models,
[01:06:33.650 --> 01:06:36.170]  the training process literally punishes the model
[01:06:36.170 --> 01:06:38.750]  for getting it wrong every single time.
[01:06:41.510 --> 01:06:43.610]  And in this case...
[01:06:45.730 --> 01:06:50.770]  Yeah, so Chris, you're asking about a false negative.
[01:06:54.010 --> 01:06:58.930]  A false negative is when you've got something that's malicious,
[01:06:58.930 --> 01:07:01.670]  that's misclassified as benign,
[01:07:01.670 --> 01:07:04.070]  which is what the point of the model is,
[01:07:04.070 --> 01:07:08.110]  but what GT brought up is probably more important
[01:07:08.110 --> 01:07:09.530]  is when you have something that's benign
[01:07:09.530 --> 01:07:11.370]  that you classify as malicious,
[01:07:11.370 --> 01:07:14.050]  that is much more common
[01:07:14.050 --> 01:07:19.270]  because you get more benign things on users' devices,
[01:07:19.270 --> 01:07:20.890]  as GT brought up.
[01:07:22.370 --> 01:07:25.450]  So in this case, I don't think this was intentional.
[01:07:26.930 --> 01:07:28.530]  No, this wasn't intentional.
[01:07:28.530 --> 01:07:31.090]  This is just the way that the model learned.
[01:07:32.270 --> 01:07:33.810]  And then as we go through the other models,
[01:07:33.810 --> 01:07:35.810]  you'll see how these shift.
[01:07:36.410 --> 01:07:40.130]  Yeah, but this is honestly the way it looks like.
[01:07:40.130 --> 01:07:43.110]  You've got the ham stuff that are classified as spam
[01:07:43.110 --> 01:07:45.330]  is very low so that all the user experience
[01:07:45.790 --> 01:07:47.470]  isn't impacted that badly.
[01:07:47.510 --> 01:07:49.250]  And the spam stuff that's classified as ham
[01:07:49.250 --> 01:07:51.290]  is relatively high,
[01:07:51.290 --> 01:07:53.230]  but you've got a trade-off
[01:07:53.230 --> 01:07:55.790]  between the false positives and the false negatives.
[01:07:55.810 --> 01:07:56.970]  Yeah, exactly.
[01:07:57.430 --> 01:08:01.010]  And I hope that answered your question, Christopher.
[01:08:01.470 --> 01:08:05.870]  I think going through the other models
[01:08:05.870 --> 01:08:08.390]  will kind of help answer it a little bit more too.
[01:08:10.830 --> 01:08:12.390]  So accuracy, 87%.
[01:08:12.950 --> 01:08:14.630]  87.5%.
[01:08:14.630 --> 01:08:16.390]  94.3%.
[01:08:17.870 --> 01:08:19.610]  In this one, we're going to use
[01:08:20.410 --> 01:08:22.210]  a model that we haven't used before,
[01:08:22.210 --> 01:08:23.450]  logistic regression.
[01:08:23.690 --> 01:08:25.970]  And so with Scikit-learn, it's very easy.
[01:08:25.970 --> 01:08:27.570]  It's the exact same thing.
[01:08:27.570 --> 01:08:29.590]  The difference is we need to
[01:08:32.470 --> 01:08:34.990]  use a LBFGS solver.
[01:08:35.650 --> 01:08:37.770]  So what that's going to look like is
[01:08:39.590 --> 01:08:40.270]  LGSTFIDF
[01:08:40.270 --> 01:08:40.990]  equals
[01:08:42.070 --> 01:08:42.750]  logistic
[01:08:44.210 --> 01:08:45.170]  regression.
[01:08:45.390 --> 01:08:47.390]  Then we're going to set our solver
[01:08:49.010 --> 01:08:50.030]  LBFGS.
[01:08:50.030 --> 01:08:51.710]  And this is just a thing with
[01:08:51.710 --> 01:08:54.110]  logistic regression specifically.
[01:08:54.430 --> 01:08:56.290]  We saw in the slides
[01:08:56.290 --> 01:08:58.150]  that multinomial naivetation
[01:08:58.150 --> 01:09:00.470]  has a variable called alpha.
[01:09:00.470 --> 01:09:01.650]  We can tweak that.
[01:09:01.650 --> 01:09:03.050]  That's one of our hyperparameters.
[01:09:03.050 --> 01:09:04.990]  This is considered a hyperparameter.
[01:09:04.990 --> 01:09:06.330]  It's the way that logistic regression
[01:09:06.770 --> 01:09:08.090]  interprets the information.
[01:09:08.750 --> 01:09:10.890]  But for this case, we're just going to use LBFGS.
[01:09:11.490 --> 01:09:12.630]  We're going to do the same thing.
[01:09:12.630 --> 01:09:14.990]  We're going to fit LGSTFIDF
[01:09:16.730 --> 01:09:17.090]  dot
[01:09:17.090 --> 01:09:17.770]  fit
[01:09:20.970 --> 01:09:21.330]  TFIDFX
[01:09:22.390 --> 01:09:23.010]  and then
[01:09:23.010 --> 01:09:24.110]  it enables.
[01:09:26.570 --> 01:09:28.310]  Let me just hop down here.
[01:09:28.510 --> 01:09:30.310]  We're going to do score equals
[01:09:34.090 --> 01:09:34.450]  LG
[01:09:34.450 --> 01:09:34.930]  S
[01:09:36.450 --> 01:09:37.850]  TFIDF equals
[01:09:38.450 --> 01:09:40.350]  LGSTFIDF dot
[01:09:40.350 --> 01:09:41.110]  score
[01:09:42.950 --> 01:09:44.130]  testTFIDFX
[01:09:44.130 --> 01:09:46.370]  Pay attention to your capitals.
[01:09:46.370 --> 01:09:48.870]  Test labels.
[01:09:48.870 --> 01:09:49.770]  Then we're going to do
[01:09:49.980 --> 01:09:50.570]  predictions
[01:09:53.530 --> 01:09:54.710]  shuns
[01:09:54.710 --> 01:09:56.550]  LGSTFIDF
[01:09:56.550 --> 01:09:57.530]  equals
[01:09:58.450 --> 01:09:59.790]  LGSTFIDF
[01:09:59.940 --> 01:10:01.570]  dot predict
[01:10:02.430 --> 01:10:03.470]  and then
[01:10:03.470 --> 01:10:04.830]  our test data
[01:10:07.070 --> 01:10:09.090]  Typing TFIDF so much
[01:10:09.090 --> 01:10:11.750]  it's very confusing for my hands.
[01:10:12.010 --> 01:10:12.710]  T matrix
[01:10:17.850 --> 01:10:18.970]  we're going to use our
[01:10:18.970 --> 01:10:19.910]  confusion matrix
[01:10:20.930 --> 01:10:22.530]  test labels
[01:10:23.050 --> 01:10:25.030]  and predictions
[01:10:26.610 --> 01:10:27.410]  LGSTFIDF
[01:10:27.410 --> 01:10:29.030]  and then our
[01:10:29.030 --> 01:10:30.030]  C report
[01:10:31.770 --> 01:10:32.570]  LGSTFIDF
[01:10:32.570 --> 01:10:33.270]  equals
[01:10:34.770 --> 01:10:35.970]  location
[01:10:36.570 --> 01:10:37.270]  report
[01:10:38.570 --> 01:10:40.310]  test labels
[01:10:40.310 --> 01:10:41.830]  predictions
[01:10:42.570 --> 01:10:43.910]  TFIDF
[01:10:46.290 --> 01:10:48.070]  That should be good.
[01:10:48.070 --> 01:10:51.010]  Just checking for errors.
[01:10:52.550 --> 01:10:54.710]  That looks right.
[01:10:56.370 --> 01:10:58.710]  So, go ahead
[01:10:58.710 --> 01:11:00.930]  copy this down.
[01:11:00.930 --> 01:11:02.930]  I will sit and
[01:11:02.930 --> 01:11:04.910]  wait for a bit, but you'll notice
[01:11:05.090 --> 01:11:06.930]  a lot of similarities to the multinomial
[01:11:06.930 --> 01:11:08.970]  Naive Bayesian. In fact, we could have just copied and
[01:11:08.970 --> 01:11:11.030]  pasted and just changed it from
[01:11:11.030 --> 01:11:12.590]  LGS to
[01:11:12.590 --> 01:11:14.970]  MNB or from MNB
[01:11:14.970 --> 01:11:16.710]  to LGS and then
[01:11:16.710 --> 01:11:18.990]  change our constructor to be a logistic regression
[01:11:18.990 --> 01:11:21.350]  with the LBFGS solver.
[01:11:24.030 --> 01:11:27.430]  We have our logistic regression constructor.
[01:11:27.430 --> 01:11:29.290]  We train it using the fit function
[01:11:29.290 --> 01:11:31.490]  using our TFIDF vectorized
[01:11:31.490 --> 01:11:33.050]  data and our labels.
[01:11:33.470 --> 01:11:35.550]  We use our test data
[01:11:35.550 --> 01:11:36.970]  and our test labels
[01:11:36.970 --> 01:11:39.250]  to generate a score and to
[01:11:39.250 --> 01:11:40.670]  generate predictions.
[01:11:41.150 --> 01:11:42.990]  And then, off of those predictions, we generate
[01:11:43.510 --> 01:11:45.510]  a confusion matrix and a
[01:11:45.510 --> 01:11:48.470]  classification report.
[01:11:48.470 --> 01:11:49.450]  And then, of course, our nice
[01:11:49.450 --> 01:11:51.490]  little helper function at the bottom that's going to print us out
[01:11:52.390 --> 01:11:54.510]  some helpful stats.
[01:11:54.510 --> 01:11:55.610]  So, I'm just going to let this
[01:11:55.610 --> 01:11:57.330]  sit for a little bit longer.
[01:11:58.930 --> 01:11:59.450]  Kometh,
[01:11:59.450 --> 01:12:01.130]  did Chris ask any more
[01:12:01.130 --> 01:12:03.390]  questions, follow-up, or did we take care
[01:12:03.390 --> 01:12:04.090]  of them?
[01:12:06.010 --> 01:12:07.130]  A bunch of people
[01:12:07.130 --> 01:12:09.250]  piled on and I helped answer the question
[01:12:09.250 --> 01:12:11.150]  because of what he asked.
[01:12:11.450 --> 01:12:13.510]  Honestly, one of the most important things
[01:12:13.510 --> 01:12:15.370]  is the false positive
[01:12:15.370 --> 01:12:17.650]  versus false negative trade-off.
[01:12:17.650 --> 01:12:18.170]  Yeah.
[01:12:20.710 --> 01:12:21.330]  Yeah,
[01:12:21.470 --> 01:12:23.410]  a lot of people have strong opinions about that
[01:12:23.410 --> 01:12:25.670]  because it's a lot of...
[01:12:25.670 --> 01:12:27.710]  It's important. It's definitely
[01:12:27.710 --> 01:12:30.090]  important, especially in data science.
[01:12:30.090 --> 01:12:31.670]  This, I'm...
[01:12:33.030 --> 01:12:33.690]  That might
[01:12:33.690 --> 01:12:34.970]  have been a good talk.
[01:12:35.350 --> 01:12:37.710]  So, here's the interesting thing, is when I
[01:12:37.710 --> 01:12:39.670]  first started out, this
[01:12:39.670 --> 01:12:41.550]  I thought was amazing.
[01:12:41.550 --> 01:12:43.530]  Now, as I've learned
[01:12:43.530 --> 01:12:45.690]  more over time, I kind of feel like, oh, no, this
[01:12:45.690 --> 01:12:47.570]  is a little clunky and a little ham-fisted.
[01:12:47.570 --> 01:12:49.670]  I'm ignoring this, I'm ignoring this, I'm ignoring this,
[01:12:49.670 --> 01:12:51.850]  I'm ignoring this. But this is a great
[01:12:51.850 --> 01:12:53.650]  way for beginners to understand machine
[01:12:53.650 --> 01:12:55.690]  learning. And then from there,
[01:12:55.690 --> 01:12:56.970]  you can kind of hop into
[01:12:58.210 --> 01:12:59.630]  a little bit more of the data
[01:12:59.630 --> 01:13:01.250]  science considerations, like
[01:13:01.250 --> 01:13:03.130]  imbalanced data sets, true
[01:13:03.650 --> 01:13:05.170]  positives, false positives,
[01:13:05.170 --> 01:13:07.310]  and some of those trade-offs,
[01:13:07.310 --> 01:13:10.070]  as well as more hyperparameter tuning.
[01:13:13.790 --> 01:13:15.590]  If you want to look at
[01:13:15.890 --> 01:13:17.910]  a malware, you keep bringing it up,
[01:13:17.910 --> 01:13:18.810]  you can look at the
[01:13:19.970 --> 01:13:21.970]  Google Endgame Ember.
[01:13:21.970 --> 01:13:23.850]  I'll drop a link in the
[01:13:23.850 --> 01:13:26.030]  Twitch chat in a sec. But there's
[01:13:26.150 --> 01:13:28.010]  a malware model that you can look
[01:13:28.010 --> 01:13:29.930]  into, like the open-source code,
[01:13:29.930 --> 01:13:31.670]  and the final thing at the end of that is
[01:13:31.670 --> 01:13:33.570]  it sets the false-positive rate to
[01:13:33.570 --> 01:13:34.850]  0.05%
[01:13:36.310 --> 01:13:38.310]  or 0.5%
[01:13:38.310 --> 01:13:39.930]  at the end, because that's
[01:13:39.930 --> 01:13:41.610]  what you should do in production,
[01:13:41.610 --> 01:13:43.690]  because customers care more about
[01:13:43.690 --> 01:13:45.730]  false positives, because
[01:13:46.590 --> 01:13:47.890]  false positives lead to
[01:13:47.890 --> 01:13:49.930]  alert fatigue, and alert fatigue leads to
[01:13:49.930 --> 01:13:51.870]  just ignoring every single
[01:13:51.870 --> 01:13:53.830]  alert, and then all the false negatives get
[01:13:53.830 --> 01:13:55.930]  through anyway. All the true negatives get through
[01:13:55.930 --> 01:13:56.770]  anyway.
[01:13:57.910 --> 01:13:59.930]  It is something...
[01:14:00.750 --> 01:14:01.910]  Yeah, and even that
[01:14:01.910 --> 01:14:04.030]  model is a little clunky and hampers.
[01:14:05.090 --> 01:14:05.750]  Well, and
[01:14:06.690 --> 01:14:08.050]  it kind of leads to
[01:14:08.050 --> 01:14:09.910]  with the alert fatigue, it kind of leads
[01:14:09.910 --> 01:14:11.690]  to the common thing that we
[01:14:12.570 --> 01:14:13.930]  hear, especially in the technical
[01:14:13.930 --> 01:14:15.990]  community from non-technical people,
[01:14:15.990 --> 01:14:17.770]  of why does my thing not work?
[01:14:18.610 --> 01:14:19.410]  Okay.
[01:14:20.270 --> 01:14:21.990]  So that kind of helps address it.
[01:14:21.990 --> 01:14:24.410]  But yeah, these are all data science considerations.
[01:14:25.410 --> 01:14:25.930]  So I'm
[01:14:25.930 --> 01:14:27.150]  going to go ahead and run this.
[01:14:27.430 --> 01:14:28.650]  Shift-Enter.
[01:14:29.690 --> 01:14:32.070]  And this is going to take a little bit longer than the multinomial
[01:14:32.070 --> 01:14:33.630]  Naive Bayesian, because logistic
[01:14:33.630 --> 01:14:35.830]  regression interprets
[01:14:35.830 --> 01:14:37.750]  the data a little differently, and so it needs to
[01:14:37.750 --> 01:14:40.150]  go through and kind of count things properly.
[01:14:41.650 --> 01:14:42.050]  This
[01:14:42.050 --> 01:14:43.190]  takes, I think, about
[01:14:43.650 --> 01:14:44.810]  30 seconds?
[01:14:46.310 --> 01:14:46.830]  Yeah, it's
[01:14:47.670 --> 01:14:48.670]  not too bad. It's
[01:14:48.670 --> 01:14:50.430]  30 to 60 seconds.
[01:14:51.430 --> 01:14:53.210]  So I'm just going to go ahead and run this.
[01:14:55.210 --> 01:14:56.790]  And then for the logistic
[01:14:56.790 --> 01:14:58.890]  regression count, we're going to do the same thing we did
[01:14:58.890 --> 01:15:00.830]  with the multinomial Naive Bayesian count
[01:15:00.830 --> 01:15:02.490]  and just copy and change words
[01:15:03.830 --> 01:15:04.850]  because I'm lazy.
[01:15:09.390 --> 01:15:10.870]  But yeah, I'm really excited to see that
[01:15:10.870 --> 01:15:12.750]  Twitch chat after I hop off to this
[01:15:12.750 --> 01:15:14.410]  workshop and see just how
[01:15:16.990 --> 01:15:18.310]  passionate people are
[01:15:18.310 --> 01:15:20.750]  about the false positive,
[01:15:20.750 --> 01:15:22.710]  true positive, and XYZ.
[01:15:23.290 --> 01:15:24.730]  Oh, we're going to erase it before
[01:15:24.730 --> 01:15:25.750]  you get off.
[01:15:26.910 --> 01:15:28.250]  Ruin my day.
[01:15:33.270 --> 01:15:35.630]  Yeah, it's taking a little bit longer than I'm used to.
[01:15:37.310 --> 01:15:39.590]  But that's okay. While this is going,
[01:15:39.590 --> 01:15:41.190]  I'm just going to go ahead
[01:15:41.190 --> 01:15:42.410]  and hop into the next bit
[01:15:45.310 --> 01:15:45.830]  and
[01:15:45.830 --> 01:15:47.890]  paste this. All we need to do
[01:15:47.890 --> 01:15:49.190]  is trade this
[01:15:49.190 --> 01:15:51.930]  TF-IDF with COUNT.
[01:15:53.030 --> 01:15:54.110]  So I'm going to go ahead
[01:15:54.110 --> 01:15:55.630]  and copy that. We're going to use the same
[01:15:55.630 --> 01:15:57.910]  constructor and the same solver.
[01:15:57.930 --> 01:15:59.730]  We're going to use COUNTEX.
[01:15:59.730 --> 01:16:02.050]  We're going to use COUNTFIT.
[01:16:03.630 --> 01:16:04.150]  COUNTCOUNT
[01:16:06.090 --> 01:16:06.610]  COUNTCOUNT
[01:16:07.330 --> 01:16:07.850]  COUNTCOUNT
[01:16:12.880 --> 01:16:14.860]  Test labels is the same.
[01:16:18.410 --> 01:16:19.050]  These are
[01:16:19.050 --> 01:16:20.070]  predictions.
[01:16:21.010 --> 01:16:22.430]  And just one
[01:16:23.190 --> 01:16:24.890]  quick look over.
[01:16:27.010 --> 01:16:28.830]  Yep, I have everything in.
[01:16:29.430 --> 01:16:31.350]  That's exactly how it's supposed to be.
[01:16:31.910 --> 01:16:33.230]  Scroll up.
[01:16:33.870 --> 01:16:35.190]  Oh, this is still going.
[01:16:35.190 --> 01:16:35.830]  Look at that.
[01:16:38.990 --> 01:16:39.870]  Weird.
[01:16:41.250 --> 01:16:42.890]  I'm going to let that sit for a bit then.
[01:16:43.270 --> 01:16:45.090]  So this one, I'm going to let you guys write down.
[01:16:45.090 --> 01:16:47.590]  This is task 6D.
[01:16:48.310 --> 01:16:49.030]  So go ahead
[01:16:49.030 --> 01:16:51.030]  and make sure that you have these written down
[01:16:51.030 --> 01:16:53.130]  while the other one is still going.
[01:16:53.130 --> 01:16:55.090]  I'm going to run this and then take a look at
[01:16:57.750 --> 01:16:58.150]  a
[01:16:58.150 --> 01:16:59.530]  logistic regression with the
[01:16:59.530 --> 01:17:05.390]  TF-IDF.
[01:17:05.390 --> 01:17:06.930]  We're actually doing really good
[01:17:06.930 --> 01:17:08.770]  on time. We have one more task
[01:17:08.770 --> 01:17:10.530]  after this. It is
[01:17:10.530 --> 01:17:13.730]  going to be the most challenging task.
[01:17:13.730 --> 01:17:16.090]  But it is going to
[01:17:13.990 --> 01:17:15.990]  take
[01:17:16.090 --> 01:17:17.750]  everything from A to Z
[01:17:17.750 --> 01:17:20.170]  and show us what we
[01:17:20.170 --> 01:17:20.990]  learned.
[01:17:23.940 --> 01:17:26.080]  So I just want to check one last
[01:17:26.080 --> 01:17:27.820]  time. And that is
[01:17:27.820 --> 01:17:28.940]  still going.
[01:17:32.740 --> 01:17:34.440]  So I'm going to go ahead and run
[01:17:34.440 --> 01:17:35.880]  this.
[01:17:37.060 --> 01:17:38.520]  This was 6D that I
[01:17:38.520 --> 01:17:40.880]  just clicked to run.
[01:17:44.420 --> 01:17:46.840]  It says the kernel is idle.
[01:17:46.840 --> 01:17:48.020]  That's not good.
[01:17:52.780 --> 01:17:53.540]  Let me see if I can
[01:17:53.540 --> 01:17:54.980]  pause this.
[01:17:59.080 --> 01:18:00.720]  Let me just interrupt it.
[01:18:00.980 --> 01:18:01.920]  Interrupt the kernel.
[01:18:05.540 --> 01:18:06.540]  So I'm going to run this
[01:18:06.540 --> 01:18:07.780]  because that took a second.
[01:18:11.470 --> 01:18:12.330]  Uh oh, I hope I didn't
[01:18:12.330 --> 01:18:13.330]  time out.
[01:18:14.190 --> 01:18:15.650]  That would be very bad.
[01:18:30.950 --> 01:18:32.370]  Let's try and run it again.
[01:18:34.230 --> 01:18:34.670]  Yeah,
[01:18:34.670 --> 01:18:35.690]  that's not good.
[01:18:42.270 --> 01:18:44.150]  Restart and clear output.
[01:18:47.180 --> 01:18:49.620]  So this is going to go all the way to the beginning.
[01:18:51.140 --> 01:18:52.680]  Let's see if I can get this to run.
[01:18:52.680 --> 01:18:54.120]  Okay, so that runs now.
[01:18:55.180 --> 01:18:57.520]  So I ran into a bit of a technical difficulty.
[01:18:57.640 --> 01:18:59.100]  Instead of going through
[01:18:59.100 --> 01:19:00.880]  and filling
[01:19:00.880 --> 01:19:02.860]  everything out, I'm just going to quickly
[01:19:02.860 --> 01:19:04.860]  grab the completed workbook.
[01:19:06.260 --> 01:19:09.700]  And I'm
[01:19:09.700 --> 01:19:10.760]  going to
[01:19:12.380 --> 01:19:13.060]  just
[01:19:13.060 --> 01:19:15.480]  go ahead, run all.
[01:19:17.180 --> 01:19:19.080]  This is going to run
[01:19:19.080 --> 01:19:21.120]  everything for me.
[01:19:22.860 --> 01:19:23.540]  So
[01:19:23.540 --> 01:19:25.380]  while I'm waiting for some of
[01:19:25.380 --> 01:19:27.620]  this stuff to run,
[01:19:27.620 --> 01:19:29.100]  let's see that packages are
[01:19:29.100 --> 01:19:32.920]  being downloaded, tokenization is defined,
[01:19:32.920 --> 01:19:33.580]  everything's good,
[01:19:33.580 --> 01:19:34.980]  everything's good, everything's good.
[01:19:42.490 --> 01:19:44.010]  Where is it stuck now?
[01:19:44.010 --> 01:19:46.790]  It is stuck...
[01:19:50.540 --> 01:19:52.520]  Looks like it's stuck reading the data.
[01:19:53.720 --> 01:19:55.400]  Training spam, training ham.
[01:20:03.010 --> 01:20:05.270]  That one's done, done.
[01:20:08.370 --> 01:20:09.210]  Yeah, when you see
[01:20:09.210 --> 01:20:11.250]  numbers here, that means that that was the order
[01:20:11.250 --> 01:20:13.130]  it was ran in. When you see a star,
[01:20:13.130 --> 01:20:14.890]  that means that it's still running.
[01:20:15.170 --> 01:20:17.090]  I'm not sure why the logistic regression
[01:20:17.090 --> 01:20:19.730]  was causing problems,
[01:20:19.730 --> 01:20:21.790]  but let's take
[01:20:21.950 --> 01:20:23.730]  a quick look at...
[01:20:28.190 --> 01:20:29.430]  Oh.
[01:20:30.310 --> 01:20:32.450]  Looks like there's some stuff missing.
[01:20:37.160 --> 01:20:38.820]  Well, that's unfortunate.
[01:20:43.000 --> 01:20:44.220]  So I'm just going to go
[01:20:44.220 --> 01:20:46.120]  ahead and create new code
[01:20:46.120 --> 01:20:48.220]  blocks. I'm just going
[01:20:48.220 --> 01:20:49.820]  to copy the code we had
[01:20:51.820 --> 01:20:54.200]  from task 6D
[01:20:54.200 --> 01:20:55.940]  and 6C.
[01:20:57.080 --> 01:20:58.760]  I apologize for scrolling
[01:20:58.760 --> 01:21:00.220]  so fast.
[01:21:01.300 --> 01:21:02.860]  So let's go ahead
[01:21:02.860 --> 01:21:04.900]  and grab all of that,
[01:21:04.900 --> 01:21:07.040]  paste it in, run it.
[01:21:07.940 --> 01:21:08.900]  Logistic regression
[01:21:08.900 --> 01:21:10.460]  is not defined.
[01:21:12.000 --> 01:21:13.180]  That's weird.
[01:21:17.240 --> 01:21:18.860]  Oh, it's not even in here.
[01:21:18.860 --> 01:21:21.440]  That is something I'm going to have to fix.
[01:21:21.440 --> 01:21:23.400]  I'm not sure... Oh.
[01:21:24.640 --> 01:21:25.920]  Because that's
[01:21:25.920 --> 01:21:27.220]  a whole different workbook.
[01:21:27.220 --> 01:21:29.540]  I need complete spam filter
[01:21:29.540 --> 01:21:32.640]  sklearn. That's the one that I need.
[01:21:33.840 --> 01:21:35.560]  My mistake.
[01:21:36.100 --> 01:21:37.800]  So kernel, restart,
[01:21:37.800 --> 01:21:38.920]  run all.
[01:21:41.880 --> 01:21:43.400]  So this is going to run
[01:21:43.400 --> 01:21:44.800]  all of them.
[01:21:46.000 --> 01:21:47.580]  Let's see where it
[01:21:47.580 --> 01:21:49.100]  gets stuck.
[01:21:49.760 --> 01:21:51.480]  Prints out the emails.
[01:21:52.340 --> 01:21:54.160]  It's training the vectorizers.
[01:21:54.160 --> 01:21:56.780]  I remember that took about a minute or so.
[01:21:59.000 --> 01:22:00.140]  But what we want
[01:22:00.140 --> 01:22:00.820]  to see
[01:22:02.660 --> 01:22:04.780]  after task 6D,
[01:22:04.780 --> 01:22:06.920]  we'll look at 6C and 6D,
[01:22:06.920 --> 01:22:08.240]  is at the end
[01:22:08.240 --> 01:22:09.880]  of this, what we're going to do is
[01:22:09.880 --> 01:22:12.100]  grab a spam
[01:22:12.100 --> 01:22:14.180]  email. This one is actually one that I grabbed
[01:22:14.180 --> 01:22:15.720]  directly from my spam box
[01:22:16.380 --> 01:22:17.680]  last year.
[01:22:18.020 --> 01:22:20.060]  And I like it because we can use
[01:22:20.060 --> 01:22:21.860]  our spam email predictor
[01:22:21.860 --> 01:22:23.680]  to identify whether or not
[01:22:23.680 --> 01:22:25.500]  it's spam. And then we can look and
[01:22:25.500 --> 01:22:26.460]  utilize
[01:22:27.940 --> 01:22:29.100]  any of the models
[01:22:29.100 --> 01:22:31.300]  to really identify that.
[01:22:33.820 --> 01:22:36.060]  So this was supposed to be
[01:22:37.400 --> 01:22:38.320]  a fill-in-the-blank
[01:22:38.320 --> 01:22:39.800]  kind of thing like we've been
[01:22:39.800 --> 01:22:40.980]  going through.
[01:22:41.460 --> 01:22:43.160]  But this last task, it just asks us
[01:22:43.160 --> 01:22:45.460]  use the vectorizers and the models created to perform
[01:22:45.460 --> 01:22:47.320]  predictions on the test email
[01:22:47.840 --> 01:22:49.640]  that we created up at the top
[01:22:49.640 --> 01:22:51.240]  and the test spam email
[01:22:52.400 --> 01:22:53.640]  which is the real
[01:22:53.640 --> 01:22:55.580]  world email that I pulled from a spam
[01:22:55.580 --> 01:22:56.740]  inbox.
[01:22:57.620 --> 01:22:59.600]  And so what we want to do
[01:22:59.600 --> 01:23:01.640]  is grab that as
[01:23:02.060 --> 01:23:03.340]  a working email.
[01:23:03.340 --> 01:23:05.640]  And this will let us change this to
[01:23:05.640 --> 01:23:07.660]  test email and test spam
[01:23:07.660 --> 01:23:08.660]  email.
[01:23:10.200 --> 01:23:11.640]  Remember, we want to
[01:23:11.640 --> 01:23:13.600]  use the tf.idf
[01:23:13.600 --> 01:23:15.460]  and the count vectorizers
[01:23:15.460 --> 01:23:17.180]  in order to use
[01:23:18.020 --> 01:23:19.600]  the working email.
[01:23:19.600 --> 01:23:21.320]  We need to use that to preprocess
[01:23:22.100 --> 01:23:23.520]  and because it's
[01:23:23.960 --> 01:23:25.540]  a string, we just wrap it in these
[01:23:25.540 --> 01:23:27.640]  square brackets in order to classify
[01:23:27.640 --> 01:23:29.680]  it as a list because the transform function
[01:23:29.680 --> 01:23:31.140]  expects a list.
[01:23:31.800 --> 01:23:33.860]  We have that as test email,
[01:23:33.860 --> 01:23:35.580]  tf.idf, and test email
[01:23:35.580 --> 01:23:36.520]  count.
[01:23:37.940 --> 01:23:39.620]  And then we use each of our models
[01:23:39.620 --> 01:23:41.540]  to form a prediction. So we have
[01:23:41.540 --> 01:23:42.500]  the multinomial
[01:23:43.160 --> 01:23:45.640]  Naive Bayesian with tf.idf, multinomial
[01:23:45.640 --> 01:23:47.480]  with count, logistic regression
[01:23:47.480 --> 01:23:49.740]  tf.idf, and logistic regression
[01:23:49.740 --> 01:23:51.720]  with count. They all use
[01:23:51.720 --> 01:23:53.880]  the same predict function. They're all using
[01:23:53.880 --> 01:23:56.500]  either the tf.idf
[01:23:56.500 --> 01:23:57.640]  vectorized version
[01:23:57.640 --> 01:23:59.260]  of the email or the count
[01:23:59.260 --> 01:24:01.540]  vectorized version of the email
[01:24:03.140 --> 01:24:04.060]  based on
[01:24:04.060 --> 01:24:05.560]  whether or not it's the tf.idf
[01:24:05.560 --> 01:24:07.580]  or the count version of the model.
[01:24:08.820 --> 01:24:09.700]  And then
[01:24:09.700 --> 01:24:13.200]  we make our predictions,
[01:24:13.200 --> 01:24:13.760]  we print
[01:24:13.760 --> 01:24:15.560]  the email itself, and then we can
[01:24:15.560 --> 01:24:17.480]  print what
[01:24:17.480 --> 01:24:19.560]  each model thinks
[01:24:19.560 --> 01:24:20.800]  it is.
[01:24:22.660 --> 01:24:23.640]  So if you're
[01:24:23.640 --> 01:24:25.720]  still on the fill in the blank bit,
[01:24:25.720 --> 01:24:27.880]  I encourage you to
[01:24:27.880 --> 01:24:29.940]  fill this out.
[01:24:30.360 --> 01:24:31.580]  If you are using
[01:24:31.580 --> 01:24:33.660]  the completed workbook, like
[01:24:33.660 --> 01:24:35.620]  I am, it's already filled out
[01:24:35.620 --> 01:24:37.520]  and we're just going to go ahead and
[01:24:37.520 --> 01:24:39.360]  let everything else
[01:24:39.360 --> 01:24:41.580]  run in the background until we get
[01:24:41.580 --> 01:24:42.940]  to this point.
[01:24:45.840 --> 01:24:47.800]  So, sorry about that.
[01:24:49.360 --> 01:24:51.520]  In the meantime, are there any
[01:24:51.520 --> 01:24:54.060]  questions, any thoughts,
[01:24:54.060 --> 01:24:55.900]  concerns, ideas?
[01:24:55.900 --> 01:24:57.640]  I'd love to hear from you.
[01:25:15.770 --> 01:25:17.210]  All right.
[01:25:17.630 --> 01:25:18.350]  Psychdude
[01:25:18.350 --> 01:25:20.190]  says, on the completed workbook
[01:25:20.190 --> 01:25:22.230]  6a, show the
[01:25:22.230 --> 01:25:24.350]  top right and bottom left
[01:25:24.350 --> 01:25:26.510]  numbers swapped in 6a.
[01:25:26.510 --> 01:25:27.790]  Wait a second.
[01:25:27.810 --> 01:25:30.590]  But the one
[01:25:30.590 --> 01:25:32.350]  you were working on earlier
[01:25:32.350 --> 01:25:33.530]  had those flipped.
[01:25:33.530 --> 01:25:36.370]  The fill in the blank and
[01:25:36.370 --> 01:25:38.510]  the completed, were
[01:25:38.510 --> 01:25:40.430]  they different somehow?
[01:25:40.430 --> 01:25:42.370]  Yeah, so I'm
[01:25:42.370 --> 01:25:44.330]  thinking the reason why they're swapped
[01:25:44.330 --> 01:25:45.490]  is because
[01:25:46.510 --> 01:25:47.230]  there was
[01:25:49.450 --> 01:25:50.970]  a way that
[01:25:50.970 --> 01:25:52.570]  Scikit-learn
[01:25:52.570 --> 01:25:54.350]  documented how they did
[01:25:54.350 --> 01:25:56.090]  their confusion matrix versus how they
[01:25:56.090 --> 01:25:57.990]  actually did their confusion matrix,
[01:25:57.990 --> 01:26:00.950]  and it was swapped from what you would normally see.
[01:26:01.370 --> 01:26:02.230]  And so, following
[01:26:02.230 --> 01:26:04.590]  what you would normally see is what I did,
[01:26:04.590 --> 01:26:06.610]  and that's where the mistake came in.
[01:26:06.610 --> 01:26:08.650]  And I can actually show you that.
[01:26:08.950 --> 01:26:10.410]  I was hoping nobody noticed, but
[01:26:10.410 --> 01:26:11.410]  good eye.
[01:26:12.930 --> 01:26:14.210]  So if you look at 6,
[01:26:14.210 --> 01:26:16.310]  actually before we go there, let's take a look
[01:26:16.310 --> 01:26:18.090]  and see the count.
[01:26:18.090 --> 01:26:20.190]  So we have 590 ham emails in our
[01:26:20.190 --> 01:26:21.950]  test dataset, 276
[01:26:21.950 --> 01:26:23.590]  spam emails in our
[01:26:23.590 --> 01:26:25.790]  test dataset. And so if we
[01:26:25.790 --> 01:26:28.310]  look at 6A,
[01:26:28.310 --> 01:26:29.750]  we see 590 ham
[01:26:29.750 --> 01:26:31.630]  emails, but
[01:26:31.630 --> 01:26:33.770]  as far as actual label, we
[01:26:33.770 --> 01:26:35.650]  should see in the column for spam
[01:26:35.970 --> 01:26:37.330]  a total of the
[01:26:37.950 --> 01:26:39.890]  276. We don't see
[01:26:39.890 --> 01:26:41.650]  that. And the reason why
[01:26:41.650 --> 01:26:43.670]  is because this is supposed to be
[01:26:43.670 --> 01:26:45.350]  the actual, see
[01:26:45.350 --> 01:26:48.790]  108 plus 168 is 276,
[01:26:48.790 --> 01:26:49.670]  and then
[01:26:49.670 --> 01:26:51.750]  this was supposed to be the predicted.
[01:26:51.790 --> 01:26:53.030]  The problem is
[01:26:53.410 --> 01:26:54.730]  the way that scikit-learn
[01:26:55.630 --> 01:26:57.730]  swapped them,
[01:26:57.730 --> 01:26:59.650]  and I guess I did not swap
[01:26:59.650 --> 01:27:00.750]  them back.
[01:27:02.290 --> 01:27:03.630]  To get the
[01:27:03.630 --> 01:27:05.530]  correct result, you just need to switch the
[01:27:05.530 --> 01:27:07.750]  test labels and the predictions
[01:27:07.750 --> 01:27:09.930]  MNB on both sides
[01:27:09.930 --> 01:27:11.770]  for the confusion matrix
[01:27:11.770 --> 01:27:13.630]  and the classification report.
[01:27:14.630 --> 01:27:15.490]  So, yeah.
[01:27:15.490 --> 01:27:17.350]  That's why they seem swapped.
[01:27:17.350 --> 01:27:19.210]  The one on the completed workbook
[01:27:19.210 --> 01:27:21.230]  is correct, because I believe
[01:27:21.230 --> 01:27:22.490]  the completed workbook
[01:27:23.590 --> 01:27:25.090]  shows in the actual label
[01:27:25.090 --> 01:27:27.210]  column the spam emails, and if you total
[01:27:27.210 --> 01:27:29.650]  the two numbers, you get 276.
[01:27:29.670 --> 01:27:31.510]  So, good eye on that.
[01:27:34.980 --> 01:27:36.740]  Go ahead and check this.
[01:27:36.980 --> 01:27:37.760]  Perfect.
[01:27:41.220 --> 01:27:43.200]  Before we ran into the problems,
[01:27:43.200 --> 01:27:44.140]  we're going to look at
[01:27:44.740 --> 01:27:47.000]  6C and 6D with a logistic
[01:27:47.000 --> 01:27:48.100]  regression.
[01:27:48.720 --> 01:27:51.020]  Remember, we had 0.85
[01:27:52.000 --> 01:27:53.320]  or 0.87
[01:27:53.320 --> 01:27:55.260]  0.5% accuracy
[01:27:55.260 --> 01:27:56.660]  for the TF-IDF
[01:27:56.660 --> 01:27:58.440]  multinomial Naive Bayesian.
[01:27:58.440 --> 01:28:00.900]  For the count multinomial Naive Bayesian,
[01:28:00.900 --> 01:28:02.580]  we have 0.94.
[01:28:03.480 --> 01:28:05.180]  For the logistic regression,
[01:28:05.180 --> 01:28:06.320]  we get a little bit better.
[01:28:06.320 --> 01:28:08.240]  We're at 0.959
[01:28:08.700 --> 01:28:10.300]  as our accuracy.
[01:28:10.580 --> 01:28:12.680]  We have fewer
[01:28:12.680 --> 01:28:14.940]  ham emails being classified as
[01:28:14.940 --> 01:28:16.560]  spam and fewer
[01:28:16.560 --> 01:28:19.260]  spam emails being classified as ham.
[01:28:19.260 --> 01:28:21.040]  We're getting better.
[01:28:23.060 --> 01:28:24.340]  With the
[01:28:24.940 --> 01:28:26.820]  task 6D, this is logistic
[01:28:26.820 --> 01:28:29.320]  regression with count vectorizer.
[01:28:29.520 --> 01:28:30.780]  This is going to be our
[01:28:30.780 --> 01:28:34.040]  best model at 0.976.
[01:28:34.800 --> 01:28:37.020]  Here we only have 8 of the
[01:28:37.020 --> 01:28:38.740]  ham emails being classified as spam
[01:28:38.740 --> 01:28:40.900]  and 12 of the spam emails being
[01:28:40.900 --> 01:28:42.420]  classified as ham.
[01:28:42.420 --> 01:28:44.320]  Of the four models that we
[01:28:44.320 --> 01:28:46.560]  trained, this
[01:28:46.560 --> 01:28:47.480]  is the better one
[01:28:48.640 --> 01:28:50.340]  in terms of accuracy.
[01:28:51.320 --> 01:28:52.780]  This is why we want to try
[01:28:52.780 --> 01:28:54.560]  different algorithms, different machine
[01:28:54.560 --> 01:28:56.400]  learning algorithms, as well as different ways
[01:28:56.400 --> 01:28:58.940]  that we preprocess or vectorize our data.
[01:28:59.520 --> 01:29:00.660]  That way we can create
[01:29:00.660 --> 01:29:02.380]  different models, see which ones perform
[01:29:02.380 --> 01:29:04.380]  the best, and then we can gravitate
[01:29:04.380 --> 01:29:05.620]  towards that one.
[01:29:09.800 --> 01:29:10.480]  As far
[01:29:10.480 --> 01:29:11.020]  as the
[01:29:11.020 --> 01:29:13.000]  test spam email,
[01:29:13.000 --> 01:29:15.300]  this is a brand new email.
[01:29:15.400 --> 01:29:17.060]  It comes in on our email server.
[01:29:17.060 --> 01:29:18.800]  We want to use our models to
[01:29:18.800 --> 01:29:20.380]  classify it.
[01:29:22.480 --> 01:29:25.140]  Let's go ahead and make a prediction.
[01:29:26.140 --> 01:29:28.380]  As I went over earlier, we're going to have
[01:29:28.380 --> 01:29:30.540]  our test spam email, which is called working email,
[01:29:30.540 --> 01:29:32.440]  so we can swap this out pretty
[01:29:32.440 --> 01:29:34.440]  seamlessly. We're going to use our
[01:29:34.440 --> 01:29:36.140]  vectorizers that we already defined
[01:29:36.140 --> 01:29:38.460]  to transform our working
[01:29:38.460 --> 01:29:40.740]  email. And then of course, being a
[01:29:40.740 --> 01:29:42.420]  string, we need to put it in a list.
[01:29:42.420 --> 01:29:44.740]  That's what the square brackets are for.
[01:29:44.740 --> 01:29:47.380]  Then we use all four of our models,
[01:29:47.380 --> 01:29:49.300]  the multinomial and the logistic
[01:29:49.300 --> 01:29:51.260]  regression with the tf and the count
[01:29:51.260 --> 01:29:54.900]  vectorizers. We use the predict function
[01:29:54.900 --> 01:29:58.080]  on the vectorized
[01:29:58.080 --> 01:29:58.280]  emails.
[01:29:58.280 --> 01:29:59.260]  And then we just go
[01:29:59.260 --> 01:30:02.120]  ahead and print out the working email itself.
[01:30:02.120 --> 01:30:02.420]  Your latest issue is available
[01:30:02.420 --> 01:30:04.540]  now. If you don't want issue notifications,
[01:30:04.540 --> 01:30:05.920]  click here to unsubscribe.
[01:30:07.440 --> 01:30:08.240]  And then
[01:30:08.240 --> 01:30:10.580]  we print out the values
[01:30:11.180 --> 01:30:12.160]  that are predicted
[01:30:12.160 --> 01:30:13.800]  from each of our models.
[01:30:14.140 --> 01:30:16.480]  If we look at it, we're going to see something interesting.
[01:30:20.020 --> 01:30:20.980]  Our best
[01:30:20.980 --> 01:30:22.400]  performing model, which is the
[01:30:22.400 --> 01:30:25.080]  logistic regression with the count vectorizer,
[01:30:25.080 --> 01:30:26.760]  correctly classified this as a
[01:30:26.760 --> 01:30:28.900]  spam email. But our worst
[01:30:28.900 --> 01:30:30.460]  performing model, even though it still had an
[01:30:30.460 --> 01:30:32.480]  87.5% accuracy,
[01:30:33.040 --> 01:30:34.300]  our multinomial Naive Bayesian
[01:30:34.300 --> 01:30:36.540]  with tf.idf incorrectly classified
[01:30:36.540 --> 01:30:38.660]  this as a ham email.
[01:30:39.620 --> 01:30:40.440]  And so this is also
[01:30:40.440 --> 01:30:41.380]  why we want to
[01:30:42.020 --> 01:30:44.280]  try building different models and get things that
[01:30:44.280 --> 01:30:45.760]  predict much better.
[01:30:51.020 --> 01:30:52.440]  Let me know if there's any
[01:30:52.440 --> 01:30:53.700]  questions on that.
[01:30:54.220 --> 01:30:56.380]  We are short on time. We're actually running
[01:30:56.500 --> 01:30:58.460]  a little over, but before I let you go, there's one more
[01:30:58.460 --> 01:31:00.740]  thing I want to show you. And that is
[01:31:00.740 --> 01:31:02.460]  the hyperparameter
[01:31:02.460 --> 01:31:04.420]  tuning. So if we
[01:31:04.420 --> 01:31:06.620]  jump up to that worst performing model,
[01:31:06.620 --> 01:31:08.100]  that 0.85%,
[01:31:08.100 --> 01:31:10.660]  the multinomial with tf.idf,
[01:31:11.580 --> 01:31:12.180]  sorry,
[01:31:12.180 --> 01:31:13.940]  0.875%.
[01:31:15.160 --> 01:31:16.400]  And we use that
[01:31:16.400 --> 01:31:18.600]  alpha that we saw in our
[01:31:18.600 --> 01:31:20.500]  formulas on the slides.
[01:31:21.320 --> 01:31:23.040]  Alpha is set to 1.
[01:31:23.040 --> 01:31:24.980]  We can go ahead and run that.
[01:31:25.060 --> 01:31:25.940]  And we see that we're at
[01:31:25.940 --> 01:31:27.560]  0.875.
[01:31:28.060 --> 01:31:29.960]  But what if we shrink that alpha
[01:31:29.960 --> 01:31:31.740]  to 0.1?
[01:31:31.740 --> 01:31:34.120]  This is hyperparameter tuning. We can change
[01:31:34.120 --> 01:31:34.840]  it.
[01:31:35.780 --> 01:31:37.920]  Now our accuracy changed from
[01:31:37.920 --> 01:31:39.680]  one of the worst models that we have
[01:31:39.680 --> 01:31:42.320]  to one of the better models that we have.
[01:31:42.640 --> 01:31:44.480]  In fact, this is second place.
[01:31:44.660 --> 01:31:46.900]  So we have 0.965%
[01:31:46.900 --> 01:31:48.060]  accuracy.
[01:31:49.480 --> 01:31:50.000]  And so you can
[01:31:50.000 --> 01:31:51.860]  see how hyperparameter tuning can
[01:31:51.860 --> 01:31:54.100]  really help out in tweaking
[01:31:54.100 --> 01:31:55.960]  the way that our models operate.
[01:32:03.130 --> 01:32:04.870]  Before I let you go,
[01:32:08.830 --> 01:32:09.830]  here's some references
[01:32:10.290 --> 01:32:12.550]  for more information.
[01:32:12.770 --> 01:32:14.050]  And if you want the slides to this
[01:32:14.050 --> 01:32:16.290]  and more information from me
[01:32:16.290 --> 01:32:18.530]  and the GitHub
[01:32:18.530 --> 01:32:20.450]  project that this is all part of,
[01:32:20.450 --> 01:32:22.430]  check out github.com.netsec
[01:32:22.430 --> 01:32:24.450]  explain and you can shoot me an email with any
[01:32:24.450 --> 01:32:26.710]  questions, thoughts, comments
[01:32:26.710 --> 01:32:28.370]  at gtklondike
[01:32:28.370 --> 01:32:29.930]  at gmail.com
[01:32:32.310 --> 01:32:34.210]  So that's all I have for you guys.
[01:32:34.210 --> 01:32:36.310]  If there's any discussion,
[01:32:36.310 --> 01:32:38.270]  TAs, do we have a separate room for the workshops
[01:32:38.270 --> 01:32:40.370]  or are we just going to use this
[01:32:40.370 --> 01:32:40.750]  Discord?
[01:32:41.570 --> 01:32:44.310]  Nope, there is a separate room for
[01:32:44.310 --> 01:32:45.590]  the workshops here. Let me
[01:32:46.230 --> 01:32:49.010]  post that in Twitch.
[01:32:51.430 --> 01:32:51.910]  Um...
[01:32:53.190 --> 01:32:55.090]  There you go, folks. If you have any
[01:32:55.090 --> 01:32:57.110]  other questions or want to continue talking
[01:32:57.110 --> 01:32:58.990]  about this workshop, you can find us
[01:32:58.990 --> 01:33:01.130]  at that channel
[01:33:01.130 --> 01:33:02.990]  in the AI Village
[01:33:02.990 --> 01:33:05.170]  in the DC Discord.
[01:33:06.070 --> 01:33:07.030]  Perfect.
[01:33:07.670 --> 01:33:08.890]  And then I'm going to go ahead
[01:33:08.890 --> 01:33:11.010]  and end the stream. I'm going to hop into
[01:33:11.010 --> 01:33:13.690]  that channel and I will see you guys there.
[01:33:15.110 --> 01:33:17.370]  Thanks a lot, Gavin. Absolutely.
[01:33:17.370 --> 01:33:18.230]  Thank you.
